dispatch_once 的秘密

Mike Ash Friday Q&A 中文译文:dispatch_once 的秘密

作者 TommyWu
封面圖片: dispatch_once 的秘密

译文 · 原文: Friday Q&A 2014-06-06: Secrets of dispatch_once · 作者 Mike Ash

原文:https://www.mikeash.com/pyblog/friday-qa-2014-06-06-secrets-of-dispatch_once.html 发布:2014-06-06 作者:Mike Ash 译者:MiMo(mimo-v2.5-pro);代码块保留英文原样


读者 Paul Kim 指出了 Michael Tsai 博客上一篇关于提升 dispatch_once 速度的文章。虽然 dispatch_once 源码中的注释既引人入胜又极具信息量,但它并未完全深入某些人期望了解的细节层面。由于这是我钟爱的技巧之一,在今天的文章中,我将详细探讨其背后的原理和工作机制。

API dispatch_once 的行为正如其名:它确保某段代码执行且仅执行一次。

该函数接收两个参数。第一个是用于追踪” 一次性” 状态的谓词(predicate)。第二个是首次调用时需要执行的块(block)。其调用形式如下:

static dispatch_once_t predicate;
dispatch_once(&predicate, ^{
// some one-time task
});

这是一个用于懒加载初始化共享状态的优秀 API(应用程序接口),无论是全局字典、单例实例、缓存,还是任何其他需要在首次使用时进行一些设置的东西。

在单线程世界中,这个调用会比较无趣,可以用一个简单的 if 语句替代。然而,我们生活在多线程世界中,而 dispatch_once 是线程安全的。它保证从多个线程同时多次调用 dispatch_once 时,只会执行一次 block(代码块),并且所有线程都会在执行完成之前等待,直到 dispatch_once 返回。即使这样,自己实现也并非太难,但 dispatch_once 还极其快速,而这才是真的难以实现。

单线程实现 我们首先来看一下这个函数的一个简单化的单线程实现。这在实际应用中几乎没什么用,但它有助于具体理解其语义。请注意,dispatch_once_t 只是 long 类型的 typedef(类型别名),初始化为零,其他值的含义则由实现来定义。函数如下:

void SimpleOnce(dispatch_once_t *predicate, dispatch_block_t block) {
if(!*predicate) {
block();
*predicate = 1;
}
}

实现很简单:如果谓词值为零,就调用 block 并将其设为一。后续调用会看到这个一而不会再次调用 block。这正是我们需要的,只是它在多线程环境中完全不安全。两个线程可能同时进入 if 语句,导致两者都调用 block,这就有问题了。不幸的是,正如常见情况,要让这段代码线程安全意味着巨大的性能损失。

性能
当讨论 dispatch_once 的性能时,实际上有三种不同场景需要考虑:

  • 首次对某个给定谓词调用 dispatch_once,这会执行 block。
  • 首次调用之后、但 block 执行完成之前的调用。此时,调用者必须等待 block 完成才能继续。
  • 首次调用之后、且 block 已执行完毕的调用。无需等待,可以立即继续。

场景 #1 的性能基本不重要,只要不是慢得离谱就行。毕竟它只发生一次。

场景 #2 的性能同样无关紧要。这种情况可能发生多次,但仅发生在 block 执行期间。在大多数情况下,它根本不会发生。即使发生,通常也只发生一次。即使在极端的压力测试中,大量线程同时调用 dispatch_once 且 block 执行时间很长,调用次数也仍会限制在数千次以内。这些调用无论如何都必须等待 block 执行完成,因此即使它们在此过程中消耗了一些不必要的 CPU 时间,也并非什么大问题。

场景 #3 的性能则极其重要。此类调用可能在程序执行过程中发生数百万甚至数十亿次。我们希望能够使用 dispatch_once 来保护那些其结果被广泛使用的只计算一次的操作。理想情况下,在某个计算上使用 dispatch_once,其开销不应超过预先显式进行计算并通过某个全局变量返回结果。换句话说,一旦进入场景 #3,我们确实希望以下两段代码的性能完全相同:

id gObject;
void Compute(void) {
gObject = ...;
}
id Fetch(void) {
return gObject;
}
id DispatchFetch(void) {
static id object;
static dispatch_once_t predicate;
dispatch_once(&predicate, ^{
object = ...;
});
return object;
}

经过内联和优化后,SimpleOnce 函数已接近实现这一目标。在我的测试中,该函数在我的计算机上执行仅需约半纳秒。这将成为线程安全版本的黄金标准。


实现线程安全的标准方法是将所有对共享数据的访问置于锁的保护之下。但这里存在一个难题,因为 dispatch_once_t 仅是一个长整型(long),没有为锁预留空间,因此难以将锁与所保护的数据相邻放置。

我们可以直接修改 API 以接受包含锁和标志的结构体。但为了保持与 dispatch_once 相同的函数签名,并且因为这只是演示代码,我决定采用单个全局锁。代码使用一个静态 pthread_mutex_t(POSIX 线程互斥锁)来保护对 predicate(断言)的所有访问。在实际程序中,若存在多个不同的断言,这种做法将非常糟糕,因为不相关的断言仍需相互等待。对于仅测试单个断言的快速示例而言,这是可行的。代码与之前相同,只是在外部增加了锁保护:

void LockedOnce(dispatch_once_t *predicate, dispatch_block_t block) {
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&mutex);
if(!*predicate) {
block();
*predicate = 1;
}
pthread_mutex_unlock(&mutex);
}

这段代码是线程安全的,但遗憾的是速度慢了很多。在我的电脑上,每次调用耗时约 30 纳秒,而前一个版本仅需半纳秒。锁本身的速度相当快,但在需要计较纳秒的场合下就不是这样了。

自旋锁(Spinlock)

自旋锁是一种通过尽量减少锁自身开销来实现锁机制的方式。其名称源于自旋锁在需要等待时会在锁上” 自旋”—— 即反复轮询检查锁是否已被释放。普通的锁会与操作系统协调,让等待的线程进入睡眠状态,并在锁释放时唤醒它们。这种方式节省了 CPU 时间,但额外的协调工作并非没有代价。通过减少这种协调,自旋锁能在锁未被持有时节省时间,但代价是当多个线程同时尝试获取锁时效率较低。

OS X 通过OSSpinLock系列函数提供了便捷的自旋锁 API。使用OSSpinLock实现LockedOnce只需替换几个名称:

void SpinlockOnce(dispatch_once_t *predicate, dispatch_block_t block) {
static OSSpinLock lock = OS_SPINLOCK_INIT;
OSSpinLockLock(&lock);
if(!*predicate) {
block();
*predicate = 1;
}
OSSpinLockUnlock(&lock);
}

这是一个相当大的改进,在我的电脑上每次调用耗时约 6.5 纳秒,而 pthread_mutex 版本每次调用需要 30 纳秒。然而,这仍然远低于不安全版本的半纳秒耗时。

原子操作(Atomic operations)是即使在没有锁的情况下也始终保证线程安全的底层 CPU 操作。(技术上,它们在硬件层面使用了锁,但那是实现细节。)它们是构建锁机制的基础组件。当锁的开销过大时,直接使用原子操作可能是提升性能的一种途径。无锁的多线程编程可能极其棘手,因此除非确实必要,否则不建议采用这种做法。在这个案例中,我们讨论的是一个可能被广泛使用的操作系统库,所以它可能符合条件。

原子操作的基本构件是比较并交换(compare and swap)。这是一个单一操作,其功能等价于:

BOOL CompareAndSwap(long *ptr, long testValue, long newValue) {
if(*ptr == testValue) {
*ptr = newValue;
return YES;
}
return NO;
}

用文字描述的话,这个操作检测内存中的某个位置是否包含特定值,如果是,则将其替换为新值。它返回值是否匹配。由于「比较并交换」是以原子 CPU 指令实现的,因此可以保证:如果多个线程同时尝试对同一内存位置执行相同的比较并交换操作,只有一个会成功。

这个版本函数的实现策略是为谓词赋三个值。0 表示该位置从未被访问。1 表示对应的代码块正在执行中,任何调用者都应等待。2 表示代码块已执行完成,所有调用者均可返回。

使用比较并交换操作来检查是否为 0,如果是则原子性地将状态转换为 1。如果成功,则该线程是第一个发起调用的,它将接着执行代码块。执行完成后,它会将谓词设置为 2 以表示已完成。

如果比较并交换操作失败,线程将进入一个循环,不断检查是否变为 2,直到成功。这将导致它等待其他线程完成代码块的执行。

第一步是将谓词指针(predicate pointer)转换为 volatile 指针,以让编译器相信该值可能在函数执行过程中被其他线程修改:

void AtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
volatile dispatch_once_t *volatilePredicate = predicate;

接下来是 compare and swap(比较交换)操作。GCC 和 Clang 都提供了各种以 __sync 开头的内置函数来实现原子操作(atomic operations)。也有更新的以 __atomic 开头的函数,但在这个实验中我沿用了自己熟悉的那些。这个调用对 predicate 执行了一次原子比较交换(atomic compare and swap),测试其是否为 0,如果匹配则将其设置为 1:

if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {

该函数在操作成功时返回 true。此处的成功意味着谓词为零且本次调用是首次调用,这意味着接下来的任务是调用块。

block();

一旦 block 执行完毕,就应该将谓词(predicate)设置为 2,以告知所有等待的线程以及未来的调用者这一事实。然而在此之前,我们需要设置一个内存屏障(memory barrier),以确保所有线程都能看到正确的读写顺序。这一点稍后会详细说明。内建函数 __sync_synchronize 就是用于执行内存屏障的:

__sync_synchronize();

这样就可以安全地设置谓词了:

*volatilePredicate = 2;
} else {

如果条件谓词(predicate)不为 0,则进入一个循环等待其变为 2。如果它已经是 2,该循环将立即终止。如果它是 1,则它将停留在此循环中,不断重新测试谓词的值,直到它变为 2:

while(*volatilePredicate != 2)
;

在返回之前,这里也需要一个内存屏障(memory barrier)来与上方的屏障匹配。同样,稍后会详细说明:

__sync_synchronize();
}
}

这行得通,而且应该是安全的。(无锁线程代码(Lockless threaded code)本身就足够复杂,我花了比现在更多的思考才能完全确认其安全性。但这属于相当简单的场景,况且我们在此并非要构建生产级别的实现。)

性能如何呢?结果证明并不理想。在我的电脑上,每次调用约需 20 纳秒,远高于自旋锁版本。

提前退出
这段代码存在一个明显的优化点:常见情况是谓词包含 2,但代码会先检测 0。通过首先检测 2 并提前退出,可以加快常见情况的处理。实现方式很简单:在函数顶部增加对 2 的检测,若匹配则在执行内存屏障(memory barrier)后返回:

void EarlyBailoutAtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
if(*predicate == 2) {
__sync_synchronize();
return;
}
volatile dispatch_once_t *volatilePredicate = predicate;
if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {
block();
__sync_synchronize();
*volatilePredicate = 2;
} else {
while(*volatilePredicate != 2)
;
__sync_synchronize();
}
}

相比代码的第一版,这已是一个不错的改进,每次调用大约耗时 11.5 纳秒。然而,这仍然比半纳秒的目标要慢得多,甚至比自旋锁(spinlock)代码还要慢。

内存屏障(memory barriers)并非零开销,这解释了为什么这段代码比目标慢得多。至于为何它比自旋锁更慢,是因为存在不同种类的内存屏障。__sync_synchronize 生成 mfence 指令,这是最严格的一种,能够处理诸如 SSE4 流式读写(streaming reads / writes)之类的特殊场景,而 OSSpinLock 则使用一种更廉价、适用于普通代码的屏障。我们可以微调此代码中使用的具体屏障类型来获得更好的性能,但很明显其成本仍将高于我们的预期,因此我将跳过这部分。

不安全的提前退出 让我们再看一个版本的代码。它与前一个版本完全相同,只是去掉了内存屏障:

void UnsafeEarlyBailoutAtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
if(*predicate == 2)
return;
volatile dispatch_once_t *volatilePredicate = predicate;
if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {
block();
*volatilePredicate = 2;
} else {
while(*volatilePredicate != 2)
;
}
}

不出所料,其性能表现与 SimpleOnce(简单单次)同样快速,每次调用仅需约半纳秒。由于 *predicate == 2 是迄今最常见的场景,几乎每次调用都只执行该检查并返回。在代码块(block)已被调用过的情况下,它执行的工作量与 SimpleOnce 相同。

然而,正如其名所示,缺乏内存屏障(memory barriers)使得这种做法并不安全。这是为什么呢?

分支预测,乱序执行,与你

我们常以为 CPU 是简单机器。我们告诉它做一件事,它就执行。接着我们告诉它做下一件事,它也照做。如此循环,直到我们厌倦或断电。

曾几何时,这确是事实。旧式 CPU 确实是如上所述般工作的简单机器:取一条指令,执行它;再取下一条指令,再执行它。

不幸的是,这种方法虽然简单、廉价且易于实现,速度却不够快。摩尔定律(Moore’s Law)赐予了我们日益增多的晶体管,使我们得以将其堆砌进 CPU 中。8086 大约由 29,000 个晶体管构成,而一款采用英特尔 Haswell 架构的 CPU 则包含了超过十亿个晶体管。

市场渴望更出色的计算机游戏(以及少数希望更快执行各种枯燥商业任务的人,尽管这些任务本身并不值得详述),这促使 CPU 制造商将技术进步用于提升性能。现代 CPU 是数十年工作的结晶,旨在将更多晶体管转化为更快的计算机。

要让 CPU 运行得更快,需要运用许多技巧。其中之一便是流水线技术(pipelining)。观察执行单条 CPU 指令所需的过程,会发现它包含许多微小的步骤:

1. Load the instruction from memory.
2. Decode the instruction. (That is, figure out which instruction it is, and figure out what the operands are.)
3. Load the input data.
4. Perform the operation on the input data.
5. Save the output.

在早期的 CPU 上,工作流程大致如下:

load instruction
decode
load data
operation
save data
load instruction
decode
load data
operation
save data
...

但只要拥有足够资源,实际上可以并行执行其中许多操作。在流水线 CPU 上,工作流程可能会更接近于这样:

load instruction
decode load instruction
load data decode load instruction
operation load data decode
save data operation load data
save data operation
save data

这要快得多!随着晶体管数量增加,系统可以变得更加复杂,能同时并行执行许多指令。

更进一步地,如果看起来能提升速度,指令可以完全乱序执行。与上面的简化示例不同,现实中的指令往往需要更多步骤,且步骤数量差异很大。例如,对主内存的读写操作可能消耗大量时间。通过执行指令流中后续的其他任务,CPU 可以从事有效工作而非闲置。正因如此,CPU 可能发现按乱序(而非指令流中的对应顺序)发出内存读写操作更为有利。

所有这些机制的最终结果是:你的代码并不总是按照表面看起来的顺序执行。如果你写下:

x = 1;
y = 2;

你的 CPU 可能会先执行对 y 的写入。编译器在某些情况下也可能对这类语句进行重排序,但即使你消除了编译器层面的重排序,CPU 本身依然可能这样做。如果你的系统中有不止一个 CPU(而如今我们几乎总是如此),那么其他的 CPU 会观察到这些乱序的写入。即使写入操作是按顺序执行的,其他 CPU 也可能执行乱序的读取。综合来看,另一个正在读取 xy 的线程,可能会在看到 y 的值为 2 的同时,仍然看到 x 的旧值。

有时你绝对需要这些值按照正确的顺序被写入和读取,这时就轮到 memory barrier(内存屏障)登场了。在上述代码中添加一个内存屏障,可以确保 x 被先写入:

x = 1;
memory_barrier();
y = 2;

类似地,读取时设置屏障可确保读操作按正确顺序执行:

use(x);
memory_barrier();
use(y);

然而,由于内存屏障(memory barrier)存在的唯一意义就是挫败 CPU 试图加速的努力,因此其本身存在固有的性能代价。

这一点在 dispatch_once 和懒初始化(lazy initialization)中尤为关键,因为它们在执行过程中会进行多次顺序读写操作,而这些操作的顺序至关重要。例如,典型的懒对象初始化模式如下所示:

static SomeClass *obj;
static dispatch_once_t predicate;
dispatch_once(&predicate, ^{ obj = [[SomeClass alloc] init]; });
[obj doStuff];

如果在检查predicate之前读取了obj,那么此代码读取它时它仍可能为nil—— 恰好在另一线程将最终值写入该变量并设置predicate之前。随后此代码可能读取到predicate表示作业已完成,进而使用未初始化的nil继续执行。甚至有可能:此代码读取了obj的正确值,但从该对象分配的内存中读取到未初始化的数据,从而在尝试发送doStuff消息时导致崩溃。

因此,dispatch_once需要内存屏障(memory barrier)。但正如我们所见,内存屏障相对较慢,若能避免,我们不愿在常见场景中付出这种代价。

分支预测与推测执行
流水线化的乱序执行模型对线性指令序列效果很好,但遇到条件分支时就会出现问题。CPU 无法在评估分支条件之前确定从哪里开始获取更多指令 —— 而分支条件通常依赖于紧邻的前序指令。CPU 必须暂停、等待先前工作完成,评估分支条件后才能继续执行。这称为流水线停顿(pipeline stall),会造成显著的性能损失。

为了弥补这种情况,CPU 会进行推测执行(speculative execution)。当遇到条件分支时,CPU 会预测分支可能的走向。现代 CPU 拥有复杂的分支预测(branch prediction)硬件,通常能以超过 90% 的准确率猜中结果。它们会直接从预测的分支开始执行指令,而不是等待分支条件被评估。如果预测正确,就继续执行下去;如果预测错误,就丢弃所有推测执行的结果,并从另一条分支重新开始。

这正是 dispatch_once 读取端面临的情况,也是我们希望尽可能提升性能的关键所在。代码中对断言(predicate)的值进行条件分支判断。CPU 应当预测该分支会走常见路径 —— 即跳过代码块执行并立即返回。在此分支的推测执行(speculative execution)期间,CPU 可能会在其他线程初始化相关值之前,就提前从内存加载后续所需的数据。倘若这个预测最终正确,CPU 就会提交这次使用了未初始化数值的推测执行。

不对称的屏障
写入屏障(write barrier)通常需要保持对称性。写入端设置屏障以确保写入操作按正确顺序完成,读取端设置屏障则确保读取操作按正确顺序进行。然而,我们在此特定场景下的性能需求是完全不对称的。我们可以容忍写入端出现较大性能损耗,但希望读取端尽可能高效。

关键在于阻止导致问题的推测执行(speculative execution)。当分支预测(branch prediction)错误时,推测执行的结果会被丢弃,这也就意味着丢弃了内存中那些可能未初始化的值。如果 dispatch_once 能够在初始化值对所有 CPU 可见之后,强制产生一次分支误预测,问题就得以避免。

在谓词(predicate)上的条件分支之后,最早可能出现的推测性内存读取,与该条件分支实际被求值之间,存在一个关键时间间隔。该间隔的确切长度取决于具体 CPU 的设计,但通常最多也就低两位数的 CPU 周期。

如果写入端在写出初始化值之后,再向谓词写入最终值之前,至少等待这么长的时间,一切都会顺利。不过要确保这一点相当棘手,因为那些疯狂的乱序执行(out-of-order execution)机制会再次发挥作用。

在 Intel CPU 上,dispatch_once(一次性分发函数)滥用 cpuid 指令(用于获取 CPU 信息的指令)来达到这个目的。cpuid 指令的存在是为了获取关于 CPU 身份和能力的信息,但它也强制序列化指令流,并且执行需要相当长的时间,在一些 CPU 型号上需要数百个周期,这足以完成这项工作。如果你查看 dispatch_once 的实现,你会发现在读取侧没有任何屏障:

DISPATCH_INLINE DISPATCH_ALWAYS_INLINE DISPATCH_NONNULL_ALL DISPATCH_NOTHROW
void
_dispatch_once(dispatch_once_t *predicate, dispatch_block_t block)
{
if (DISPATCH_EXPECT(*predicate, ~0l) != ~0l) {
dispatch_once(predicate, block);
}
}
#define dispatch_once _dispatch_once

这段内容位于头文件中,并且总是内联到调用方。DISPATCH_EXPECT 是一个宏,它告诉编译器生成代码,进而指示 CPU:当 *predicate~0l 的分支是更可能执行的路径。这可以提升分支预测(branch prediction)的成功率,从而改善性能。归根结底,这只是一个普通的 if 语句,没有任何类型的内存屏障。性能测试证实了这一点:调用真正的 dispatch_once 与 0.5ns 的目标相符。

写入操作可以在实现中找到。在调用 block 之后、执行任何其他操作之前,dispatch_once 使用了如下这个宏:

dispatch_atomic_maximally_synchronizing_barrier();

在 Intel 平台上,该宏会生成 cpuid 指令,而在针对其他 CPU 架构时则会生成相应的汇编代码。

结论
多线程是一个奇异而复杂的领域,现代 CPU 设计使其更甚 —— 它们常常在你的背后进行诸多操作。内存屏障(memory barriers)允许你在真正需要特定顺序执行时告知硬件,但这需要付出代价。dispatch_once 的特殊性使其能够以非传统的方式绕过该问题:通过在相关内存写入操作间等待足够长时间,它确保读取者始终能看到一致的数据视图,而无需在每次访问时都承担内存屏障的开销。

今天的内容就到这里。下次再来探索更多精彩内容,大概会是关于 Swift 的讨论 —— 毕竟这是当前的热门话题。如果你想在这里看到某个主题(除了 “谈谈 Swift” 之外,这个主题不错但目前已经收到足够多的请求),请告诉我!


#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2014-06-06-secrets-of-dispatch_once.html

Reader Paul Kim pointed out an entry on Michael Tsai’s blog about making dispatch_once fast. While the comment in the dispatch_once source code is fascinating and informative, it doesn’t quite delve into the detail that some would like to see. Since this is one of my favorite hacks, for today’s article I’m going to discuss exactly what’s going on there and how it all works.

The APIThe behavior of dispatch_once is in the name. It does something once and only once.

It takes two parameters. The first is a predicate that tracks the “once”. The second is a block to execute on the first call. Calls look like this:

static dispatch_once_t predicate;
dispatch_once(&predicate, ^{
// some one-time task
});

It’s a great API for lazily initializing shared state, whether it’s a global dictionary, a singleton instance, a cache, or anything else that needs a bit of setup the first time through.

In a single-threaded world, this call would be kind of boring and could be replaced with a simple if statement. However, we live in a multithreaded world, and dispatch_once is thread safe. It’s guaranteed that multiple simultaneous calls to dispatch_once from multiple threads will only execute the block once, and all threads will wait until execution is complete before dispatch_once returns. Even that is not too hard to accomplish on your own, but dispatch_once is also extremely fast, and that is really tough to pull off.

Single-Threaded ImplementationLet’s first look at a simplistic single-threaded implementation of the function. This is just about useless in a practical sense, but it’s good to be have a concrete view of the semantics. Note that dispatch_once_t is just a typedef for long, initialized to zero, and with the meaning of other values left up to the implementation. Here’s the function:

void SimpleOnce(dispatch_once_t *predicate, dispatch_block_t block) {
if(!*predicate) {
block();
*predicate = 1;
}
}

The implementation is easy: if the predicate contains zero, call the block and set it to one. Subsequent calls will see the one and not call the block again. Exactly what we want, aside from being completely unsafe in a multithreaded environment. Two threads could hit the if statement simultaneously, causing them both to call the block, and that’s bad. Unfortunately, as is often the case, making this code thread safe means a substantial performance hit.

PerformanceWhen talking about the performance of dispatch_once, there are really three different scenarios to consider:

  • The first ever call to dispatch_once with a given predicate, which executes the block.

  • Calls to dispatch_once after the first call, but before the block finishes executing. Here, callers have to wait for the block before they proceed.

  • Calls to dispatch_once after the first call and after the block has executed. No waiting is required and they can immediately proceed.

The performance of scenario #1 is largely unimportant, as long as it’s not absurdly slow. It only happens once, after all.

The performance of scenario #2 is similarly unimportant. It could happen potentially several times, but it only happens while the block is executing. In most cases, it never happens at all. If it does, it probably only happens once. Even in a torture test where lots of threads call dispatch_once simultaneously and the block takes a long time to execute, the number of calls is still going to be limited to thousands. Those calls all have to wait for the block anyway, so it’s not that big of a deal if they burn some unnecessary CPU time while they do it.

The performance of scenario #3 is enormously important. Calls of this nature could happen potentially millions or billions of times as a program executes. We want to be able to use dispatch_once to protect one-time calculations whose results are used all over the place. Ideally, sticking dispatch_once on a computation should cost no more than explicitly doing the computation ahead of time and returning the result from some global variable. In other words, once you hit scenario #3, we really want these two chunks of code to perform identically:

id gObject;
void Compute(void) {
gObject = ...;
}
id Fetch(void) {
return gObject;
}
id DispatchFetch(void) {
static id object;
static dispatch_once_t predicate;
dispatch_once(&predicate, ^{
object = ...;
});
return object;
}

When inlined and optimized, the SimpleOnce function comes close to achieving this. In my testing, on my computer, it takes about half a nanosecond to execute. That, then, will be the gold standard for thread-safe versions.

LocksThe standard way to make something thread safe is to surround all access to shared data with a lock. It’s a little tricky here, because there isn’t a good place to put the lock next to the data it’s protecting. dispatch_once_t is just a long, and there’s no room for a lock.

We could just modify the API to take a structure that contains a lock and a flag. But in the interests of retaining the same signature as dispatch_once, and because this is just illustrative code, I decided to go with a single global lock instead. The code uses a static pthread_mutex_t to protect all accesses to the predicate. In a real program, with many different predicates, this would be a terrible idea, as unrelated predicates would still wait on each other. For a quickie example where I’m only testing with one predicate anyway, it’s fine. The code is the same as before, except with a lock around it:

void LockedOnce(dispatch_once_t *predicate, dispatch_block_t block) {
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&mutex);
if(!*predicate) {
block();
*predicate = 1;
}
pthread_mutex_unlock(&mutex);
}

This code is thread safe, but unfortunately it’s much slower. On my computer, it takes about 30 nanoseconds per call, compared to half a nanosecond for the previous version. Locks are pretty fast, but not when nanoseconds count.

SpinlocksA spinlock is a way to implement a lock that tries to minimize overhead in the lock itself. The name comes from how a spinlock implementation “spins” on the lock when it has to wait, repeatedly polling it to see if it’s been released. A normal lock will coordinate with the OS to sleep waiting threads, and wake them up when the lock is released. This saves CPU time, but the additional coordination doesn’t come for free. By cutting down on that, a spinlock saves time in the case where the lock isn’t held, at the expense of less efficient behavior when multiple threads try to take the lock at the same time.

OS X provides a convenient API for spinlocks with the OSSpinLock functions. Implementing LockedOnce with OSSpinLock is just a matter of changing out a few names:

void SpinlockOnce(dispatch_once_t *predicate, dispatch_block_t block) {
static OSSpinLock lock = OS_SPINLOCK_INIT;
OSSpinLockLock(&lock);
if(!*predicate) {
block();
*predicate = 1;
}
OSSpinLockUnlock(&lock);
}

This is a considerable improvement, clocking in at about 6.5 nanoseconds per call on my computer, versus 30 nanoseconds per call for the pthread_mutex version. However, it’s still well short of the half-nanosecond time of the unsafe version.

Atomic OperationsAtomic operations are low-level CPU operations that are always thread safe even without locks. (Technically, they use locks at the hardware level, but that’s an implementation detail.) They are what you use to implement locks in the first place. When locks carry too much overhead, directly using atomic operations can be a way to improve performance. Threaded programming without locks can be extremely tricky, so it’s not a good idea to do it unless you really, really need it. In this case, we’re talking about an OS library that could get a lot of use, so it probably qualifies.

The building block of atomic operations is the compare and swap. This is a single operation that performs the equivalent of:

BOOL CompareAndSwap(long *ptr, long testValue, long newValue) {
if(*ptr == testValue) {
*ptr = newValue;
return YES;
}
return NO;
}

In words, it tests to see if a location in memory contains a particular value, and replaces it with a new value if so. It returns whether the value matched. Because compare and swap is implemented as an atomic CPU instruction, you’re guaranteed that if multiple threads all try to do the same compare-and-swap on a memory location, only one will succeed.

The implementation strategy for this version of the function is to assign three values to the predicate. 0 means that it’s never been touched. 1 indicates that the block is currently being executed, and any callers should wait for it. 2 indicates that the block has completed and all callers can return.

A compare and swap will be used to check for 0 and atomically transition to 1 in that case. If that succeeds, then that thread is the first one to make the call, and it will then run the block. After running the block, it will set the predicate to 2 to indicate that it’s done.

If the compare and swap fails, then it will just enter a loop, checking for 2 repeatedly until it succeeds. This will cause it to wait for the other thread to finish executing the block.

The first step is to convert the predicate pointer into a volatile pointer, to convince the compiler that the value could be changed by other threads in the middle of the function:

void AtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
volatile dispatch_once_t *volatilePredicate = predicate;

Then comes the compare and swap. Gcc and clang both provide various builtin functions beginning with __sync that implement atomic operations. There are also newer ones beginning with __atomic, but for this experiment I stuck with what I knew. This call does an atomic compare and swap on predicate, testing for 0 and setting it to 1 if it matches:

if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {

The function returns true if the operation succeeded. In this case, that means that the predicate was 0 and this call is the first one. That means that the next task is to call the block:

block();

Once the block completes, it’s time to set the predicate to 2 to indicate that fact to any waiting threads, and to any future callers. However, before that, we need a memory barrier to ensure that everyone sees the proper ordering of reads and writes. More on that in a little bit. The __sync_synchronize builtin function performs a memory barrier:

__sync_synchronize();

Then the predicate can safely be set:

*volatilePredicate = 2;
} else {

If the predicate was not 0, then enter a loop to wait for it to become 2. If it’s already 2, this loop will terminate immediately. If it’s 1, then it will sit on the loop, constantly retesting the predicate value, until it becomes 2:

while(*volatilePredicate != 2)
;

Before returning, there needs to be a memory barrier here as well, to match the barrier above. Again, more on this in a little bit:

__sync_synchronize();
}
}

This works, and should be safe. (Lockless threaded code is tricky enough that I don’t want to declare that outright without more thought than I’ve put into it already. But this is a fairly simple scenario, and in any case we’re not actually trying to build something production-worthy here.)

How’s performance? Turns out, not so good. On my computer, it takes about 20 nanoseconds per call, quite a bit higher than the spinlock version.

Early BailoutThere’s an obvious optimization that can be applied to this code. The common case is when predicate contains 2, but the code first tests for 0. By first testing for 2 and bailing out early, the common case can be made faster. The code for this is simple: add a check for 2 at the top of the function, and if it succeeds, return after a memory barrier:

void EarlyBailoutAtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
if(*predicate == 2) {
__sync_synchronize();
return;
}
volatile dispatch_once_t *volatilePredicate = predicate;
if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {
block();
__sync_synchronize();
*volatilePredicate = 2;
} else {
while(*volatilePredicate != 2)
;
__sync_synchronize();
}
}

This is a decent improvement over the first version of the code, at about 11.5 nanoseconds per call. However, this is still much slower than the goal of half a nanosecond, and it’s even slower than the spinlock code.

Memory barriers aren’t free, and that explains why this code is so much slower than the goal. As for why it’s slower than spinlocks, there are different kinds of memory barriers available. __sync_synchronize generates an mfence instruction, which is the most paranoid one possible, working with such exotica as SSE4 streaming reads/writes, while OSSpinLock sticks with a cheaper one suitable for normal code. We could fiddle around with the precise barrier used in this code to get better performance, but it’s apparent that the cost is still going to be higher than we want, so I’ll skip over that.

Unsafe Early BailoutLet’s look at one more version of the code. It’s identical to the previous one, except it gets rid of the memory barriers:

void UnsafeEarlyBailoutAtomicBuiltinsOnce(dispatch_once_t *predicate, dispatch_block_t block) {
if(*predicate == 2)
return;
volatile dispatch_once_t *volatilePredicate = predicate;
if(__sync_bool_compare_and_swap(volatilePredicate, 0, 1)) {
block();
*volatilePredicate = 2;
} else {
while(*volatilePredicate != 2)
;
}
}

Unsurprisingly, this performs just as fast as SimpleOnce at about half a nanosecond per call. Since *predicate == 2 is by far the common case, nearly every call just performs that check and returns. This performs the same amount of work as SimpleOnce in the case where the block has already been called.

However, as the name indicates, the lack of memory barriers makes it unsafe. Why?

Branch Prediction, Out of Order Execution, and YouWe imagine that our CPUs are simple machines. We tell them to do one thing, and they do it. Then we tell them to do the next thing, and they do it. This repeats until we get bored or the power goes out.

Once upon a time, this was actually true. Old CPUs really were simple machines that worked exactly like this. They fetched an instruction, and they executed it. Then they fetched the next instruction, and they executed it.

Unfortunately, while this approach is simple, cheap, and easy, it’s also not very fast. Moore’s Law has blessed us with an ever-growing number of transistors to pile into CPUs. The 8086 was built from about 29,000 transistors. An Intel Haswell-architecture CPU contains well over a billion transistors.

The market, desiring better computer games (as well as a few people wanting to more quickly perform various boring business tasks that aren’t really worth mentioning) demanded that CPU makers use these advances for better performance. The modern CPU is the product of decades of work to turn more transistors into faster computers.

There are a lot of tricks that go into making CPUs faster. One of those is pipelining. When you look at what goes into executing a single CPU instruction, there are a lot of little steps:

1. Load the instruction from memory.
2. Decode the instruction. (That is, figure out which instruction it is, and figure out what the operands are.)
3. Load the input data.
4. Perform the operation on the input data.
5. Save the output.

On an early CPU, the sequence of work would look something like:

load instruction
decode
load data
operation
save data
load instruction
decode
load data
operation
save data
...

But as long as you have the resources for it, you can actually perform many of these actions in parallel. On a pipelined CPU, the sequence of work can end up looking more like this:

load instruction
decode load instruction
load data decode load instruction
operation load data decode
save data operation load data
save data operation
save data

This is a lot faster! With enough transistors, things can become even more complex, with many instructions executing in parallel simultaneously.

Going even beyond this, instructions can be executed completely out of order if it looks like this will speed things up. Unlike the simplified example above, real-world instructions tend to require more steps and there is a lot of variance in how many steps they take. For example, reads and writes to main memory can take an enormous amount of time. By executing other work that comes afterwards in the instruction stream, the CPU can engage in productive work instead of sitting idle. Because of this, the CPU may find it advantageous to issue memory reads and writes out of order from how the corresponding instructions appear in the instruction stream.

The end result of all this stuff is that your code doesn’t always run in the order it looks like it runs in. If you write:

x = 1;
y = 2;

Your CPU can perform the write to y first. The compiler can also reorder statements like this in some cases, but even if you eliminate that, the CPU can still do it. If you have more than one CPU in your system (as we pretty much always do these days) then other CPUs will see these out of order writes. Even if the writes are performed in order, other CPUs can perform out of order reads. Put it all together, and another thread that’s reading x and y can see y = 2 while still seeing the old value for x.

Sometimes you absolutely need these values to be written and read in the proper order, and that’s where memory barriers come in. Adding a memory barrier to the above code ensures that x is written first:

x = 1;
memory_barrier();
y = 2;

Similarly, a barrier when reading ensures that reads are performed in the proper order:

use(x);
memory_barrier();
use(y);

However, because the memory barrier’s entire purpose in life is defeating the CPU’s attempts to go faster, there is an inherent performance penalty.

This comes into play with dispatch_once and lazy initialization because there are multiple reads and writes done in sequence and their order is extremely important. For example, the typical lazy object initialization pattern looks like:

static SomeClass *obj;
static dispatch_once_t predicate;
dispatch_once(&predicate, ^{ obj = [[SomeClass alloc] init]; });
[obj doStuff];

If obj was read before predicate, then it’s possible that this code could read it when it still contains nil, right before another thread writes the final value out to the variable and sets predicate. This code could then read predicate as indicating that the job is done, and thus proceed with using the uninitialized nil. It’s even conceivable that this code could read the correct value for obj but read uninitialized values from the memory allocated for the object, causing a crash when trying to send doStuff.

Thus, dispatch_once needs a memory barrier. But as we’ve seen, memory barriers are relatively slow, and we don’t want to pay that price in the common case if we can avoid it.

Branch Prediction and Speculative ExecutionThe pipelined, out-of-order model works great for a linear sequence of instructions, but trouble arises for conditional branches. The CPU doesn’t know where to start fetching more instructions until the branch condition can be evaluated, which typically depends on the immediately preceding instructions. The CPU has to stop, wait for the previous work to finish, then evaluate the branch condition and resume execution. This is called a pipeline stall, and can cause a significant performance penalty.

In order to compensate for that, CPUs engage in speculative execution. When they see a conditional branch, they make a guess as to which way they think the branch will go. Modern CPUs have sophisticated branch prediction hardware that can typically guess correctly over 90% of the time. They begin executing instructions from the guessed branch, instead of just waiting for the branch condition to be evaluated. If it turns out that the guess was correct, then it just keeps going. If the guess was wrong, it throws away all of the speculative execution and restarts on the other branch.

It is specifically this scenario that’s at play on the read side of dispatch_once, the side that we want to make as fast as possible. There is a conditional branch on the value of predicate. The CPU should predict that the branch will take the common path, which bypasses running the block and immediately returns. During speculative execution of this branch, the CPU might load subsequent values from memory before they have been initialized by another thread. If that guess ended up being correct, then it commits the speculative execution which used uninitialized values.

Unbalancing BarriersWrite barriers generally need to be symmetrical. There’s one on the write side to ensure that writes are done in the proper order, and one on the read side to ensure that reads are done in the proper order. However, our performance needs in this particular case are completely asymmetrical. We can tolerate a huge slowdown on the write side, but we want the read side to be as fast as possible.

The trick is to defeat the speculative execution that causes the problem. When the branch prediction is incorrect, the results of the speculative execution are discarded, and that means discarding the potentially uninitialized values in memory. If dispatch_once can force a branch mispredict after the initialized values become visible to all CPUs, the problem is avoided.

There’s a critical interval of time between the earliest possible speculative memory read after the conditional branch on predicate and when the conditional branch is actually evaluated. The exact length of that interval will depend on the design of the specific CPU, but it will generally be in the low dozens of CPU cycles at most.

If the write side waits at least that much time between writing out the initialized values and writing the final value to predicate, all is well. It gets a bit tricky to ensure, though, since all of that crazy out-of-order execution comes into play once again.

On Intel CPUs, dispatch_once abuses the cpuid instruction for this purpose. The cpuid instruction exists to get information about the identity and capabilities of the CPU, but it also forcibly serializes the instruction stream and takes a decent amount of time to execute, hundreds of cycles on some CPU models, which is enough to do the job.

If you look at the implementation of dispatch_once, you’ll see no barriers whatsoever on the read side:

DISPATCH_INLINE DISPATCH_ALWAYS_INLINE DISPATCH_NONNULL_ALL DISPATCH_NOTHROW
void
_dispatch_once(dispatch_once_t *predicate, dispatch_block_t block)
{
if (DISPATCH_EXPECT(*predicate, ~0l) != ~0l) {
dispatch_once(predicate, block);
}
}
#define dispatch_once _dispatch_once

This is in the header, and always inlined into the caller. DISPATCH_EXPECT is a macro that tells the compiler to emit code that tells the CPU that the branch where *predicate is ~0l is the more likely path. This can improve the success of branch prediction, and thus improve performance. Ultimately, this is just a plain if statement with no barriers of any kind. Performance testing bears this out: calls to the real dispatch_once match the 0.5ns goal.

The write side can be found in the implementation. After calling the block and before doing anything else, dispatch_once uses this macro:

dispatch_atomic_maximally_synchronizing_barrier();

On Intel, the macro generates a cpuid instruction, and it generates the appropriate assembly when targeting other CPU architectures.

ConclusionMultithreading is a strange and complicated place, made all the more so by modern CPU designs which can do many things behind your back. Memory barriers allow you to inform the hardware when you really need things to happen in a certain order, but they come at a cost. The unique requirements of dispatch_once allow it to work around the problem in an unconventional way. By waiting a sufficiently long time between the relevant memory writes, it ensures that readers always see a consistent picture without needing to pay the cost of a memory barrier on every access.

That’s it for today. Come back next time for more exciting stuff, probably about Swift since that’s all the talk these days. If you have a topic you’d like to see covered here, and it’s something besides “talk about Swift” (which is a fine topic but has sufficient requests at this point), please send it in!