Poking around in Speedometer

Written by Michael Stanton

My marching orders today were:

OPPORTUNITIES FOR SMALL IMPROVEMENT IN JQUERY SPEEDOMETER BNCHMRK. STOP.
WHY CALLS TO RUNTIME KEYED LOAD? WHY CALLS TO HANDLEAPICALL? WHY? STOP.

These days I’m a manager type, more comfortable droning on in sonorous tones with made up words like “leveraging” and “embiggen,” but p’raps I can saddle up and look at the code again.


First! To use Jaro’s new ProfView tool! (v8/tools/profview/index.html). Hilariously, I was poking around in /tools and found profviz, an apparently abandoned tool that Jaro didn’t even know about. “As a manager, I encourage you to leverage existing resources for maximum effect.”

He just stared at me.

Okay! Back to work. First I run the Speedometer jQuery benchmark in chrome like this:

out/Release/chrome --no-sandbox --js-flags="--prof"

I have to use --no-sandbox so that an isolate-v8.log file can be written to disk. And --js-flags is the place to pass V8 options.

I ran the test three times, shut Chrome down, then process the log file. Oh jeez, there are four of them. Sigh. I’ll try the first one. Running in the chrome directory:

v8/tools/linux-tick-processor --preprocess isolate-0x20c3d10ee000-v8.log >
    mystery.json

I load that file up in ProfView, and see the following:

The first issue is from the function cleanData, where a keyed load ends up in the runtime both from an unoptimized version of cleanData and an optimized version. Wha? Can’t we do better? Adding insult to injury, Runtime_KeyedGetProperty calls Blink’s indexedPropertyGetterCallback.

HEADQRTRS THOUGHT WE HAD A FAST PATH FOR THAT.
STOP.

Here is the code:

cleanData: function( elems, /* internal */ acceptData ) {
  var elem, type, id, data, i = 0,
    internalKey = jQuery.expando,
    cache = jQuery.cache,
    deleteExpando = jQuery.support.deleteExpando,
    special = jQuery.event.special;

  for ( ; (elem = elems[i]) != null; i++ ) {

    if ( acceptData || jQuery.acceptData( elem ) ) {
      id = elem[ internalKey ];
      data = id && cache[ id ];
      if ( data ) {
        if ( data.events ) {
           ...

Well, I see at least 3 keyed loads in the code above, and there are even more below. Hmm, ProfView can’t tell me which keyed load is the expensive one. I’ll use Chrome Developer Tools to take a snapshot of a run to try and get more hints…

The line data = id && cache[ id ] gets most of the ticks. It’s reasonable to conclude this is the heaviest load. In fact though, it’s the lightest.

Time to visit my old friend, --trace-ic and watch what happens to this keyed load ic. Things have changed in the old neighborhood…the output no longer goes to the console, but the log file. M’okay.

v8/tools/ic-processor isolate-0x5acebee7000-v8.log > mystery-ic.txt

$ cat mystery-ic.txt | grep KeyedLoadIC | grep cleanData
KeyedLoadIC (0->1) at cleanData ../jquery.js:6572:27 0 (map 0x20d03abb5aa9)
KeyedLoadIC (0->.) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abcffc1)
KeyedLoadIC (.->1) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abb5529)
KeyedLoadIC (1->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abd0281)
KeyedLoadIC (1->1) at cleanData ../jquery.js:6572:27 0 (map 0x20d03aba4121)
KeyedLoadIC (1->N) at cleanData ../jquery.js:6572:27 1 (map 0x20d03aba4121)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abcffc1)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abd5191)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abb55d9)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abb62e9)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abd53f9)
KeyedLoadIC (0->1) at cleanData ../jquery.js:6577:25 14 (map 0x20d03aba2ee9)
KeyedLoadIC (0->1) at cleanData ../jquery.js:6593:17 14 (map 0x20d03aba2ee9)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abb55d9)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abb62e9)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abd53f9)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abd0281)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abcffc1)
KeyedLoadIC (N->N) at cleanData ../jquery.js:6576:16 jQuery19101482418194915336 (map 0x20d03abb5529)
...

Hmm. It’s sad to see that for loop with elem = elems[i] go generic. That’s weird. I also notice that I can’t tell from this log output if the IC is in the optimized version of cleanData or the unoptimized version. The old trace-ic logs used to display this. I’ll just take care of that now, yallz.

Turning away from IC tracing for a moment, let’s look at the optimized code for cleanData.

out/Release/chrome --no-sandbox --js-flags="--trace-opt-code --trace-opt" > code.txt

Now I can look at the optimized code for cleanData and see when keyed load ICs are used. At the risk (certainty!) of overbroad generalization, if we use an IC in optimized code it means we’ve given up trying to optimize it. The code is “generic” from our point of view. Hopefully we can do better from this point.

    -- B4 start (loop up to 160) --
0x2e72378a7380   120  488b4dd8       REX.W movq rcx,[rbp-0x28]
0x2e72378a7384   124  48c1e120       REX.W shlq rcx, 32
    -- ../jquery.js:6572:27 --
0x2e72378a7388   128  488b75f8       REX.W movq rsi,[rbp-0x8]
0x2e72378a738c   12c  488b5518       REX.W movq rdx,[rbp+0x18]
0x2e72378a7390   130  48b8000000000e000000 REX.W movq rax,0xe00000000
0x2e72378a739a   13a  4c8bc0         REX.W movq r8,rax
0x2e72378a739d   13d  e89e8ef2ff     call 0x2e72377d0240  (KeyedLoadICTrampoline)
    ;; code: KEYED_LOAD_IC

Yuck, this is the for loop with elem = elems[i]. So sad that we are generic! I really want to know why. I ran chrome with the debugger, first adjusting gn args out/Debug to include the key v8_optimized_debug = false so I could inspect variables. I found the place in the IC system where we update the IC state, using printf-style debugging to stop at the right place:


// in src/ic/ic.cc:

void KeyedLoadIC::UpdateLoadElement(Handle<HeapObject> receiver) {
  ...

  if (strcmp("cleanData",
             GetHostFunction()->shared()->DebugName()->ToCString().get()) == 0) {
    PrintF("This is your stop.\n");
  }
  ...

Debugging Chrome can be tricky, here is the command line I used:

gdb --args out/Debug/chrome --disable-gpu --single-process --no-sandbox
    http://browserbench.org/Speedometer/InteractiveRunner.html

Setting a breakpoint on the PrintF statement, I first hit it when we went monomorphic. Then we come in with a new map and stay monomorphic. On the third hit we go generic with the message same map added twice. I could inspect the receiver and the feedback vector with our friendly debugger helpers (available in v8/tools/gdbinit).

(gdb) frame 2
#2  0x00007f2a77e8a2d1 in v8::internal::__RT_impl_Runtime_KeyedLoadIC_Miss (args=...,
    isolate=0xe93d3eb9020) at ../../v8/src/ic/ic.cc:2439
2439      RETURN_RESULT_OR_FAILURE(isolate, ic.Load(receiver, key));
(gdb)

In this frame I can look at the input variables. The receiver:

(gdb) job *receiver
0x9cf0a77a209: [JSArray]
 - map = 0x3114690a4121 [FastProperties]
 - prototype = 0x3f6568a88141
 - elements = 0x9cf0a77a239 <FixedArray[1]> [FAST_ELEMENTS]
 - length = 1
 - properties = 0x61c9ec02241 <FixedArray[0]> {
    #length: 0x38b9c858c59 <AccessorInfo> (const accessor descriptor)
 }

Hmm, we have an elements store of length 1. What about our key?

(gdb) p key
$8 = {<v8::internal::HandleBase> = {location_ = 0x7f2a495e8920}, <No data fields>}
(gdb) job *key
 Smi: 0x1 (1)

Oh. The key is a Smi with the value 1. Sigh. We are looking at the array beyond the end. And a quick glance at the KeyedLoadIC dispatcher shows that we MISS to the runtime in this case (this code is in AccessorAssembler::HandleLoadICSmiHandlerCase(), I won’t go into it here but it has a telling line of code…Label* out_of_bounds = miss; LOL. This means out of bounds access results in a trip to the runtime. So you see, v8 is not optimized to clever code like for ( ; (elem = elems[i]) != null; i++ ) {).

I locally changed the code to look more old-fashioned because I wanted to see the KeyedLoadIC go away.

cleanData: function( elems, /* internal */ acceptData ) {
  ...
  var len = elems.length;
  for ( ; i < len; i++ ) {
    elem = elems[i];
    ...

My change made the keyed load run in polymorphic 2-degree state. Why wasn’t this enough to emit optimized loads? Time to debug TurboFan. We have code in JSNativeContextSpecialization::ReduceElementAccess() which builds the polymorphic loads keying on the receiver map. I stepped into this…

A look at the feedback vector made me realize it wouldn’t be so simple. One of the maps we had to handle is an indexed interceptor.

(gdb) jfv *nexus.vector_handle_
0x5932c9d0ff1: [FeedbackVector]
 - length: 66
 Slot #0 LoadProperty MONOMORPHIC
 ...
 Slot #16 LoadKeyed POLYMORPHIC
  [18]: 0xfda86d07429 <FixedArray[4]>
  [19]: 0x1428fdb04e21 <Symbol: uninitialized_symbol>
 Slot #18 Call MONOMORPHIC
 ...
(gdb) job 0xfda86d07429
0xfda86d07429: [FixedArray]
 - map = 0xfca2f802309 <Map(FAST_HOLEY_ELEMENTS)>
 - length: 4
           0: 0x1ec5780083f9 <WeakCell value= 0x3ba1419a4121 <Map(FAST_ELEMENTS)>>
           1: 72
           2: 0x1ae5bf10de49 <WeakCell value= 0x3ba1419b5b01 <Map(FAST_HOLEY_SMI_ELEMENTS)>>
           3: 0x29f74c85261 <Code HANDLER>

What this means is that we have a polymorphic keyed load with smi-like keys (the uninitialized symbol in [19] tells us this). The fixed array is a mapping of the receiver map to a handler. Sometimes the handler is code, sometimes it’s a smi telling us what to do. A look at the code shows that it’s the indexed interceptor handler:

(gdb) jco 0x29f74c85261
0x29f74c85261: [Code]
kind = HANDLER
major_key = LoadIndexedInterceptorStub
compiler = turbofan
Instructions (size = 92)
... (I removed some stuff) ...
0x29f74c85318    38  48bb609731d0d17f0000 REX.W movq rbx,0x7fd1d0319760
    ;; external reference (Runtime::LoadElementWithInterceptor)
0x29f74c85322    42  b802000000     movl rax,0x2
0x29f74c85327    47  4889542410     REX.W movq [rsp+0x10],rdx
0x29f74c8532c    4c  e9cff1dfff     jmp 0x29f74a84500 ;; code: STUB, CEntryStub, minor: 8
// Call LoadElementWithInterceptor through the CEntryStub.

Back in JSNativeContextSpecialization::ReduceElementAccess(), I see the writing on the wall. We only do fast handling of polymorphic keyed element loads if all the Maps are simple fast element arrays. We call AccessInfoFactory::ComputeElementAccessInfos() which runs a test on each Map that succinctly expresses what we consider reasonable to inline:

bool CanInlineElementAccess(Handle<Map> map) {
  if (!map->IsJSObjectMap()) return false;
  if (map->is_access_check_needed()) return false;
  if (map->has_indexed_interceptor()) return false;  // Tears of a clown...
  ElementsKind const elements_kind = map->elements_kind();
  if (IsFastElementsKind(elements_kind)) return true;
  if (IsFixedTypedArrayElementsKind(elements_kind)) return true;
  return false;
}

So I know that the elem = elems[i] code is generic for two reasons: out of bounds access and the use of an indexed interceptor on one of the maps. I waddled over to Toon’s desk (they feed us a lot here, jus’ sayin), and asked him about the possibility of optimizing indexed interceptor access. He pointed out that even if we were able to avoid going through the IC handler that makes the runtime call, we are still on our way into C++ on the Blink side. For now then, I’ll quit trying to optimize this place in the code.

Moving on to id = elem[ internalKey ]

internalKey is a string like “jQuery1910…”. The receiver is a JS_API_OBJECT_TYPE.

(gdb) job *object
0x3c19691f4d91: [JS_API_OBJECT_TYPE]
 - map = 0xf0136cb4f51 [FastProperties]
 - prototype = 0x230e375a0009
 - elements = 0x1a845fd82241 <FixedArray[0]> [FAST_HOLEY_SMI_ELEMENTS]
 - embedder fields: 2
 - properties = 0x1a845fd82241 <FixedArray[0]> {}
 - embedder fields = {
    0x7fe301dfb650
    0x1046fdd4ef80
 }

A comment in objects.h sez “Like JS_OBJECT_TYPE but created from Api functions.” Cool.

Anyway, in the IC system we don’t find the property on the object, so we prepare for a non-existent load. We create a Tuple3 which is a struct with three values: a holder cell, a smi-handler and a validity cell. This information makes it possible to handle the load as efficiently as possible while safeguarding against structural changes that invalidate the approach.

The smi-handler value is 0x7, which means it’s a handler meant to “load” non-existent properties. Generally, this means returning undefined to the caller. The validity cell protects against this ever changing.

How did I find all this out? By poking around in the feedback vector. I did this poking around after the first miss, so the IC is in monomorphic state at this point. It’s only later we’ll go megamorphic. Here is the feedback vector slot for our keyed load:

(gdb) jfv nexus_->vector()
0x671f59d0e21: [FeedbackVector]
 - length: 64
....
 Slot #20 LoadKeyed MONOMORPHIC
   [22]: 0x231997de8b99 <String[27]: jQuery191044337671379997645>
   [23]: 0x2a9781fd401 <FixedArray[2]>

and here is the data with the Tuple3 structure, as pulled out of the FixedArray in [23] above:

(gdb) job 0x2a9781fd401
0x2a9781fd401: [FixedArray]
 - map = 0x1a6253c82309 <Map(FAST_HOLEY_ELEMENTS)>
 - length: 2
 0: 0x231997de25f9 <WeakCell value= 0xad329834ef9 <Map(FAST_HOLEY_SMI_ELEMENTS)>>
 1: 0x671f59d63e1 <Tuple3>
(gdb) job 0x671f59d63e1
0x671f59d63e1: [Tuple3]
 - value1: 0x3e7913e02201 <null>
 - value2: 7
 - value3: 0x231997de25d1 <Cell value= 0>

So this load is quite complicated already, and soon goes Megamorphic because we come in with different receiver maps.

Anything else to do?

The last keyed load I’m interested in is for the line data = id && cache[ id ]. cache is the JQuery cache object, just an ordinary object with some 150 holey array elements. This keyed load is monomorphic and uses a simple element handler. Turbofan should definitely optimize this. In the vector we store the map of the JQuery cache, and have a smi value to represent what we should do. It encodes whether or not the object is a JSArray (it is not), and what the ElementsKind is (FAST_HOLEY_ELEMENTS).

Happily, this load has been optimized in TurboFan.

Unhappily for me, I don’t have an A-HA! discovery in this block to crow over. Performance investigations sometimes are like that.

As for the Builtin_HandleApiCall issue in unoptimized JavaScript function curCSS in the output from Jaro’s tool, Benedikt did some follow-up there, recognizing cases where we could use a stub to make the api call less expensive. A story for another day…