This is wrong. The notion of predictability in modern branch predictors is much ...

seanmcdirmid · on Aug 10, 2015

Interesting. I do know that some BPUs can disable branch prediction if it fails a few times (to say, given i % 2 == 0, it would disable branch prediction on the third or fourth iteration). I wasn't aware that they would actually go and find continuity in non-continuous branch behavior based on modulo alone, which isn't that common.

So let's run an experiment, shall we. Given Count = x.Length = 10000000 in C# (.NET, but the GC will never run):

Code:

        for (var i = 0; i < Count; i += 1)
          if (i % 2 == 0)
            x[i] = 0;
          else x[i] = 1;

Time: 55-80 milliseconds

Now the same functionality in two non-branching loops:

        for (var i = 0; i < Count; i += 2)
          x[i] = 0;
        for (var i = 1; i < Count; i += 2)
          x[i] = 1;

About 25-28 milliseconds. Now, to make it interesting, we do zero for the first half and one for the second half:

        for (var i = 0; i < Count; i += 1)
          if (i < Count / 2)
            x[i] = 0;
          else x[i] = 1;

40-42 milliseconds. Definitely, a coherent branch is much faster than an incoherent one, while no branching wins the day as always. (ok, not true, see edit).

Edit: so I was probably running into some jitter problems. Running the tests 20 times gives me 41 for the first, 21 for the second, and 39 for the last. So the first and last are comparable in time. Sorry for the sloppy benchmarking!

Edit 2: so if I take the Count/2 out of the loop, my numbers for the third test become fast again: 42 for the first, 19 for the second, 26 for the third (average time milliseconds for doing the loops 10 times). The comparison is still hardly fair since I'm not lifting the modulo out of the first loop, however. If I use a boolean that is continuously negated instead the modulo, the results are 29 for the first, which is much closer to the third.

caf · on Aug 10, 2015

Running the C equivalent of your first example under cachegrind with --branch-sim=yes, it shows 10,000,000 conditional branches at the if (i % 2 == 0) line, with 3 mispredicts.

nkurz · on Aug 10, 2015

Cachegrind simulates caching on a generic CPU, and thus does not tell you how the code will actually run on any particular modern CPU. In this case, the results agree, but I'm of the strong opinion that cachegrind's time has passed. Unless you want a 'theoretical' rather than real answer, you are almost always better off today using the CPUs built in performance counters. That said, I do appreciate that you are offering real numbers rather than guesswork like I am.

caf · on Aug 10, 2015

I think that basic branch-sims are still useful as a kind of lowest common denominator - if the simple-minded branch sim can predict it OK, it'll almost certainly go OK on a wide variety of real CPUs. Do you disagree?

Benchmarking on a particular CPU does give you more precise results, but at the risk of them being less generally applicable.

nkurz · on Aug 10, 2015

Sure, I think it's fair to say that if cachegrind finds the branching to be predictable, that any modern processor will do fine on it. But if cachegrind predicts poor performance, I wouldn't suggest changing your code unless you've discovered that the actual performance is poor on a real processor. Overall, I think cachegrind's branch prediction is nowadays less accurate than the rule of thumb that "if there is a pattern the CPU will find it". Since using performance counters is so simple and accurate (at least on x86/x64), I'm not sure that there is a benefit to using the slower, less-accurate older method of emulation.

You response encouraged me to check whether the branch predictor in cachegrind had been updated since I last looked at it. It doesn't look like it. It's still pretty simple, about 20 lines of actual code: https://github.com/fredericgermain/valgrind/blob/master/cach...

I greatly appreciated the honesty in one of the comments: "TODO: use predictor written by someone who understands this stuff."

It is a valid question might whether the consistency from processor to processor is any better than cachegrind or a rule of thumb. My experience was that for Nehalem/Sandy Bridge/Haswell things were more same than different (and only got better), but I don't know about other lines.

seanmcdirmid · on Aug 10, 2015

Here is a more exhaustive test that throws in a random boolean array (bb) for a total mispredict scenario (always read, sometimes used)

First, modulo oscillation:

      var b = true;
      for (var i = 0; i < x.Length; i += 1)
      {
        var cc = bb[i];
        if (b)
          x[i] = 0;
        else
          x[i] = 1;
        b = !b;
      }

Time: 36 milliseconds (averaged over 10 runs)

Case 2, no branch:

    for (var i = 0; i < x.Length; i += 2)
    {
      var c = bb[i];
      x[i] = 0;
    }
    for (var i = 1; i < x.Length; i += 2)
    {
      var c = bb[i];
      x[i] = 1;
    }

Time: 22 milliseconds

Case 3, coherent branching:

    var kk = x.Length / 2;
    for (var i = 0; i < x.Length; i += 1)
    {
      var c = bb[i];
      if (i < kk)
        x[i] = 0;
      else x[i] = 1;
    }

Time: 31 milliseconds

Case 4, totally random branching:

    for (var i = 0; i < x.Length; i += 1)
        if (bb[i])
          x[i] = 0;
        else x[i] = 1;

Time: 75 milliseconds (the slowness could come from bad branch prediction, or the fact that a load needs to complete before the condition can be verified!)

caf · on Aug 10, 2015

My point is just to reiterate that the 0,1,0,1,0,1,... pattern of i % 2 == 0 is actually exactly the kind of pattern that branch predictors excel at, which you seemed to be skeptical of.

seanmcdirmid · on Aug 10, 2015

Right, I was wrong. I'm wondering what the use case is for that kind of silicon... interpreters?