Wednesday, May 21, 2014

Making dispatch_once() Fast

I had assumed that dispatch_once() was implemented as a basic atomic compare-and-swap, but the source for dispatch_once_f contains an interesting comment:

Normally, a barrier on the read side is used to workaround the weakly ordered memory model. But barriers are expensive and we only need to synchronize once! After func(ctxt) completes, the predicate will be marked as “done” and the branch predictor will correctly skip the call to dispatch_once*().

A far faster alternative solution: Defeat the speculative read-ahead of peer CPUs.

Modern architectures will throw away speculative results once a branch mis-prediction occurs. Therefore, if we can ensure that the predicate is not marked as being complete until long after the last store by func(ctxt), then we have defeated the read-ahead of peer CPUs.

In other words, the last “store” by func(ctxt) must complete and then N cycles must elapse before ~0l is stored to *val. The value of N is whatever is sufficient to defeat the read-ahead mechanism of peer CPUs.

On some CPUs, the most fully synchronizing instruction might need to be issued.

N is determined by dispatch_atomic_maximally_synchronizing_barrier(), which has different assembly language implementations for different architectures.

Update (2014-05-28): Greg Parker explains a consequence of this optimization:

dispatch_once_t must not be an instance variable.

The implementation of dispatch_once() requires that the dispatch_once_t is zero, and has never been non-zero. The previously-not-zero case would need additional memory barriers to work correctly, but dispatch_once() omits those barriers for performance reasons.

Instance variables are initialized to zero, but their memory may have previously stored another value. This makes them unsafe for dispatch_once() use.

Update (2014-06-06): Mike Ash:

While the comment in the dispatch_once source code is fascinating and informative, it doesn’t quite delve into the detail that some would like to see. Since this is one of my favorite hacks, for today’s article I’m going to discuss exactly what’s going on there and how it all works.

1 Comment RSS · Twitter

[…] Logan Collins (compare with Apple’s): […]

Leave a Comment