剖析 ARM64 上的 objc_msgSend

文章發布時間 2017年6月30日

作者 TommyWu

標籤

译文 · 原文： Friday Q&A 2017-06-30: Dissecting objc_msgSend on ARM64 · 作者 Mike Ash

原文：https://www.mikeash.com/pyblog/friday-qa-2017-06-30-dissecting-objc_msgsend-on-arm64.html 发布：2017-06-30　作者：Mike Ash 译者：MiMo（mimo-v2.5-pro）；代码块保留英文原样

我们回来了！在 WWDC 期间，我在 CocoaConf Next Door 演讲，其中一个讲座涉及对 objc_msgSend 的 ARM64 实现的解析。我认为将其写成文章将是回归 Friday Q & A 博客的一个好方式。

概述：每个 Objective-C 对象都有一个类，每个 Objective-C 类都有一个方法列表。每个方法都有一个「selector（选择子）」、一个指向「implementation（方法实现）」的「函数指针（function pointer）」，以及一些「元数据（metadata）」。objc_msgSend 的任务是接受传入的对象和 selector，查找相应方法的函数指针，然后跳转到该函数指针。

查找方法可能非常复杂。如果在一个类上没有找到方法，那么它需要继续在父类中搜索。如果完全没有找到方法，那么它需要调用「运行时（runtime）」的「消息转发（message forwarding）」代码。如果这是首次向特定类发送消息，那么它必须调用该类的「+initialize 方法（+initialize method）」。

在常见情况下，查找方法还需要极快，因为它针对每个方法调用都会执行。当然，这与复杂的查找过程相冲突。

Objective-C 解决这一矛盾的方法就是方法缓存。每个类都有一个缓存，其中以选择子（selector）和函数指针（即 Objective-C 中的 IMP）的配对形式存储方法。这些缓存以哈希表（hash table）的形式组织，因此查找速度很快。当查找一个方法时，运行时系统会首先查询缓存。如果该方法不在缓存中，则会遵循那个缓慢而复杂的过程，然后将结果放入缓存中，以便下次查找时能快速完成。

objc_msgSend 是用汇编语言编写的。这样做的原因有两个：其一，在 C 语言中无法编写一个既能保持未知参数不变又能跳转到任意函数指针的函数。该语言根本不具备表达这种行为所需的特性。其二，objc_msgSend 的运行速度至关重要，因此它的每一条指令都是手动编写的，以确保能达到最快速度。

自然，你不想用汇编语言编写整个复杂的消息查找（method lookup）过程。这也没必要，因为一旦开始这个过程，无论怎么做都会变慢。消息发送（message send）代码可以分为两部分：objc_msgSend 本身的快速路径（fast path）是用汇编编写的，而慢速路径（slow path）则用 C 语言实现。汇编部分会在缓存（cache）中查找方法，如果找到就跳转到该方法。如果方法不在缓存中，则调用 C 语言代码来处理后续事宜。

因此，objc_msgSend 本身执行以下操作：

获取传入对象的类（class）。
获取该类的方法缓存（method cache）。
使用传入的选择子（selector）在缓存中查找方法。
如果不在缓存中，则调用 C 语言代码。
跳转到该方法的 IMP（方法实现）。

它是如何做到所有这些的呢？让我们看看！

逐条指令讲解 objc_msgSend 根据具体情形有几条不同的执行路径。它包含特殊代码来处理诸如向 nil 发消息、tagged pointer（标记指针）以及哈希表碰撞等情况。我将从最常见、最直接的情形开始分析：当消息被发送给一个非 nil、非标记指针的对象，且方法在缓存（cache）中被直接找到而无需扫描时的情况。在讲解过程中，我会指出各个分支点的位置；待完成常见路径的讲解后，我将回溯并分析所有其他路径。

我将列出每条指令或一组指令，随后描述其功能及原因。请注意根据描述内容向上查找对应的具体指令。

每条指令前都标注了其相对于函数起始位置的偏移量。这个偏移量用作计数器，便于识别跳转目标。

ARM64 架构拥有 31 个宽度为 64 位的整数寄存器。它们用 x0 到 x30 的记号来表示。也可以通过 w0 到 w30 访问每个寄存器的低 32 位，就像访问一个单独的寄存器一样。寄存器 x0 到 x7 用于向函数传递前八个参数。这意味着 objc_msgSend 在 x0 中接收 self 参数，在 x1 中接收 selector（选择子）参数 _cmd。

我们开始吧！

1
    0x0000 cmp     x0, #0x0
2
    0x0004 b.le    0x6c

这段代码对 self 与 0 进行有符号比较，如果值小于等于零则跳转到其他位置。值为零即 nil，因此这处理了向 nil 发送消息的特殊情况。这也处理了标记指针（tagged pointers）。在 ARM64 上，标记指针通过将指针的最高位设为 1 来表示。（这与 x86-64 形成了有趣对比，后者使用的是最低位。）如果最高位被置位，那么当解释为有符号整数时，该值为负数。对于 self 是普通指针的常见情况，此分支不会被跳转执行。

1
    0x0008 ldr    x13, [x0]

这段代码通过加载 x0 寄存器所指向的 64 位数据来加载 self 的 isa，因为 x0 寄存器存储着 self 实例。现在 x13 寄存器中便包含了 isa 指针。

1
    0x000c and    x16, x13, #0xffffffff8

ARM64 架构支持使用非指针类型的 isa（isa 指针）。传统上，isa 指向对象所属的类（class），但非指针 isa 利用了其中的空闲位（spare bits），将其他信息也一并压缩存储到 isa 中。该指令通过逻辑 AND 操作来屏蔽所有这些附加位，最终将实际的类指针保留在寄存器 x16 中。

1
    0x0010 ldp    x10, x11, [x16, #0x10]

这是我最喜欢的 objc_msgSend 指令。它将类的缓存信息（cache information）加载到 x10 和 x11 寄存器中。ldp 指令（load pair，加载一对）从内存中加载两个寄存器大小的数据到前两个参数指定的寄存器中。第三个参数描述从何处加载数据，此处是从 x16 偏移 16 字节的位置开始，x16 指向的是存放缓存信息的类结构区域。缓存本身看起来像这样：

1
    typedef uint32_t mask_t;
2

3
    struct cache_t {
4
        struct bucket_t *_buckets;
5
        mask_t _mask;
6
        mask_t _occupied;
7
    }

在执行ldp指令之后，x10寄存器中存储的是_buckets的值，而x11寄存器的高 32 位存储着_occupied，低 32 位则存储着_mask。

_occupied用于表示哈希表中包含的条目数量，在objc_msgSend函数中并无实际作用。_mask则至关重要：它以一种便于进行 AND 操作的掩码形式，描述了哈希表的大小。其值总是 2 的幂次减 1，用二进制表示则类似于末尾带有可变数量 1 的000000001111111。为了计算 selector（选择子）的查找索引，并在搜索哈希表时实现环绕效果，必须使用这个值。

1
    0x0014 and    w12, w1, w11

此指令用于计算传入的 _cmd 参数对应选择子的初始哈希表索引。寄存器 x1 存储 _cmd，因此 w1 包含 _cmd 的低 32 位。如前所述，w11 存储 _mask 值。该指令将两者进行按位与操作，结果存入 w12。此运算等效于计算 _cmd % table_size，但避免了开销较大的模运算。

1
    0x0018 add    x12, x10, x12, lsl #4

索引本身还不够。要开始从表中加载数据，我们需要实际加载地址。该指令通过将表索引与表指针相加来计算此地址。它先将表索引左移 4 位，这相当于将其乘以 16，因为每个表桶（bucket）占 16 字节。此时 x12 寄存器包含待搜索的第一个桶的地址。

1
    0x001c ldp    x9, x17, [x12]

我们的朋友 ldp 再次登场。这次它从 x12 中的指针加载数据，该指针指向要搜索的桶（bucket）。每个桶包含一个 selector（选择子）和一个 IMP（方法实现）。现在 x9 存储着当前桶的选择子，而 x17 存储着方法实现。

1
    0x0020 cmp    x9, x1
2
    0x0024 b.ne   0x2c

这些指令将桶（bucket）中的选择子（selector）与寄存器 x1 中的 _cmd 进行比较。如果两者不相等，说明当前桶中没有我们正在寻找的选择子对应的条目，此时第二条指令会跳转到偏移地址 0x2c 处，该处代码负责处理不匹配的桶。如果选择子匹配成功，则说明我们已找到目标条目，程序将顺序执行下一条指令。

1
    0x0028 br    x17

这段代码会无条件跳转到 x17 寄存器中存储的地址，该地址存放着从当前哈希桶（bucket）加载得到的 IMP（方法实现指针）。从这里开始，执行将继续进入目标方法的实际实现，至此 objc_msgSend 的快速路径（fast path）执行完毕。所有参数寄存器都未被修改，因此目标方法将直接接收到所有传入的参数，就如同直接调用该方法一样。

当所有缓存都命中且条件完美契合时，在现代硬件上这条路径可以在不到 3 纳秒内执行完毕。

这是快速路径的情况，那么代码的其余部分是怎样的呢？让我们继续查看哈希桶不匹配时的代码。

1
    0x002c cbz    x9, __objc_msgSend_uncached

x9 寄存器包含从桶（bucket）中加载的选择子（selector）。这条指令将其与零进行比较，若为零则跳转至 __objc_msgSend_uncached。零选择子表示这是一个空桶，而空桶意味着搜索失败。目标方法不在缓存中，此时需要回退到执行更全面查找的 C 代码。__objc_msgSend_uncached 负责处理这一情况。否则，若桶不匹配但非空，则继续搜索。

1
    0x0030 cmp    x12, x10
2
    0x0034 b.eq   0x40

该指令比较当前桶地址（x12 寄存器中的值）与哈希表起始地址（x10 寄存器中的值）。若两者相等，则跳转至处理代码，将搜索位置重新定位到哈希表末尾。我们尚未提及，此处执行的哈希表搜索实际上是反向进行的 —— 搜索过程会检查递减的索引值，直到触及表头，随后从表尾重新开始循环。我不确定为何采用这种反向搜索而非更常见的正向递增地址并环绕回表头的方式，但可以合理推测：这种设计最终能获得更优性能。

偏移量 0x40 处处理环绕情况。否则，程序将继续执行下一条指令。

1
    0x0038 ldp    x9, x17, [x12, #-0x10]!

另一条 ldp 指令，再次加载一个缓存桶。这次，它从偏移量 0x10 处加载到当前缓存桶的地址。地址引用末尾的感叹号是一个有趣的特性，它表示一次寄存器写回（register write-back），这意味着寄存器将被更新为新计算出的值。在这个例子中，它在加载新缓存桶的同时，有效地执行了 x12 -= 16 操作，从而使 x12 指向那个新的缓存桶。

1
    0x003c b      0x20

现在新的桶已经加载完毕，执行可以继续回到检查当前桶是否匹配的代码。这段程序会跳转回上方标记为 0x0020 的指令处，并使用新获取的值重新运行整个检查流程。如果它持续找到不匹配的桶，这段代码将一直循环执行，直到发现匹配项、遇到空桶，或者触及哈希表的起始位置。

1
    0x0040 add    x12, x12, w11, uxtw #4

这是当搜索回绕时的目标。x12 包含一个指向当前 bucket（桶）的指针，在这种情况下也是第一个 bucket。w11 包含 table mask（表掩码），即表的大小。这将两者相加，同时将 w11 左移 4 位，乘以 16。结果是 x12 现在指向表的末尾，搜索可以从那里恢复。

1
    0x0044 ldp    x9, x17, [x12]

现在熟悉的 ldp 指令将新的 bucket 加载到 x9 和 x17 中。

1
    0x0048 cmp    x9, x1
2
    0x004c b.ne   0x54
3
    0x0050 br     x17

这段代码用于检查 bucket 是否匹配，并跳转到该 bucket 的 IMP（方法实现）。这是对上方 0x0020 处代码的重复。

1
    0x0054 cbz    x9, __objc_msgSend_uncached

就像之前一样，如果缓存桶是空的，那么这就是一次缓存未命中，程序会继续执行用 C 语言实现的完整查找代码。

1
    0x0058 cmp    x12, x10
2
    0x005c b.eq   0x68

再次进行回绕（wraparound）检查，如果已第二次到达表格起点，则跳转至 0x68。此处将跳转至用 C 实现的综合查找代码：

1
    0x0068 b      __objc_msgSend_uncached

这种情况本不应发生。哈希表（Hash tables）会随着条目增加而扩容，永远不会达到 100% 满载。当哈希表过于饱和时，由于冲突变得过于频繁，其效率会显著降低。

那么这里为什么需要这段代码？源代码中的注释解释道：

克隆扫描循环以避免缓存损坏时挂起。慢速路径可能稍后检测到任何损坏并停止。

（译注：该注释重复出现，可能是代码历史遗留或强调说明）

我怀疑这种情况并不常见，但显然苹果公司的工程师们曾遇到过内存损坏导致缓存被错误条目填满的情况，而跳转到 C 代码（C code）可以改善诊断信息。

对于未遭遇此类损坏的代码，此检查的存在应影响甚微。若没有它，原始循环可被复用，从而节省少量指令缓存空间（instruction cache space），但效果微乎其微。这个环绕处理程序（wraparound handler）本身并非常见情况。它仅会被那些被排序到哈希表起始位置附近的选择子（selectors）触发，并且仅在发生冲突且所有前置条目均被占用时才会被调用。

1
    0x0060 ldp    x9, x17, [x12, #-0x10]!
2
    0x0064 b      0x48

循环的剩余部分与之前相同。将下一个桶加载到 x9 和 x17 寄存器中，更新 x12 中的桶指针，然后跳回循环顶部。

至此，objc_msgSend 函数主体部分结束。剩余的是针对 nil 和标记指针（tagged pointer）的特殊情况处理。

标记指针处理器 你会记得最开始的指令检测了这两种情况，并跳转到偏移量 0x6c 处进行处理。我们从那里继续：

1
    0x006c b.eq    0xa4

我们到达这里是因为 self 小于等于零。小于零表示一个 tagged pointer（标记指针），而零表示 nil。这两种情况的处理方式完全不同，因此代码首先检查 self 是否为 nil。如果 self 等于零，那么该指令会跳转到 0xa4，即 nil 处理逻辑所在的位置。否则，它就是一个 tagged pointer，执行将继续到下一条指令。

在继续之前，我们简要讨论一下 tagged pointer 的工作原理。Tagged pointer 支持多个类。标记指针的高四位（在 ARM64 上）指示该 “对象” 属于哪个类。它们本质上是 tagged pointer 的 isa。当然，四位不足以容纳一个类指针。相反，有一个特殊的表存储了可用的 tagged pointer 类。一个 tagged pointer “对象” 的类是通过查找该表中与高四位对应的索引来找到的。

这还不是全部。Tagged 指针（至少在 ARM64 上）也支持扩展类（extended classes）。当最高的四位全部设置为 1 时，接下来的八位将被用作索引，用于访问一个扩展的 tagged 指针类表。这使得运行时能够支持更多的 tagged 指针类，代价是这些类的可用存储空间变少了。

我们继续。

1
    0x0070 mov    x10, #-0x1000000000000000

这会将 x10 设置为一个整数值，其高位四个比特置 1，其余所有比特置零。这个值将用作掩码，以从 self 中提取标签位（tag bits）。

1
    0x0074 cmp    x0, x10
2
    0x0078 b.hs   0x90

这段代码用于检查一个扩展 tagged 指针（extended tagged pointer）。如果self的值大于或等于寄存器x10中的值，那么就意味着高 4 位已被全部置位。在那种情况下，程序将跳转至地址0x90以处理扩展类（extended classes）。否则，将使用主要的 tagged 指针表。

1
    0x007c adrp   x10, _objc_debug_taggedpointer_classes@PAGE
2
    0x0080 add    x10, x10, _objc_debug_taggedpointer_classes@PAGEOFF

这段代码展示了如何加载 _objc_debug_taggedpointer_classes（主带标签指针表）的地址。ARM64 架构需要两条指令来加载一个符号的地址。这是类 RISC 架构上的标准技术。在 ARM64 上，指针宽度为 64 位，而指令宽度仅为 32 位。一条指令无法容纳整个指针值。

x86 架构则不存在这个问题，因为它采用可变长度指令。它可以使用一条 10 字节的指令，其中两字节标识指令本身和目标寄存器，八字节存放指针值。

在具有固定长度指令的机器上，你需要分块加载值。此处仅需两块：adrp 指令加载值的高位部分，随后 add 指令加上低位部分。

1
    0x0084 lsr    x11, x0, #60

「tagged class index（标签类索引）」位于 x0 的高四位中。要将其用作索引，需要右移 60 位，使其成为 0 到 15 范围内的整数。这条指令执行该移位操作，并将索引放入 x11 中。

1
    0x0088 ldr    x16, [x10, x11, lsl #3]

这段代码利用 x11 中的索引值，从 x10 指向的表中加载对应条目。此时 x16 寄存器已包含这个标记指针（tagged pointer）的类信息。

1
    0x008c b      0x10

当类指针存入 x16 后，我们可以分支回主代码。从偏移量 0x10 开始的代码假定类指针已加载至 x16 寄存器，并从此处执行派发（dispatch）。因此，标签指针（tagged pointer）处理器可以直接分支回该代码，而无需在此处重复逻辑。

1
    0x0090 adrp   x10, _objc_debug_taggedpointer_ext_classes@PAGE
2
    0x0094 add    x10, x10, _objc_debug_taggedpointer_ext_classes@PAGEOFF

扩展标签类处理器的结构看起来类似。这两条指令加载了指向扩展表的指针。

1
    0x0098 ubfx   x11, x0, #52, #8

此指令用于加载扩展类索引。它从 self 的第 52 位开始提取 8 位并存入 x11。

1
    0x009c ldr    x16, [x10, x11, lsl #3]

就像之前一样，该索引用于在表中查找类并将其加载到 x16。

1
    0x00a0 b      0x10

现在类地址已保存在 x16 寄存器中，它可以分支跳回到主代码。

至此几乎全部讲完，剩下的就是 nil 处理器。

nil 处理器

终于到了 nil 处理器。完整代码如下：

1
    0x00a4 mov    x1, #0x0
2
    0x00a8 movi   d0, #0000000000000000
3
    0x00ac movi   d1, #0000000000000000
4
    0x00b0 movi   d2, #0000000000000000
5
    0x00b4 movi   d3, #0000000000000000
6
    0x00b8 ret

与其余代码完全不同的 nil 处理程序，既不进行类查找（class lookup），也不执行方法派发（method dispatch）。对于 nil 对象，它唯一的工作就是向调用者返回 0。

这个任务有点复杂，因为 objc_msgSend 不知道调用者期望何种返回值。该方法返回的是一个整数、两个整数、一个浮点值，还是什么也不返回？

幸运的是，即便返回值寄存器未被用于特定调用的返回，也可以安全地覆盖它们。整数返回值存储在 x0 和 x1 中，浮点返回值存储在向量寄存器 v0 至 v3 中。对于较小的结构体返回值，则会使用多个寄存器。

这段代码清除了 x1 以及 v0 到 v3。d0 到 d3 寄存器对应着相应的 v 寄存器的低半部分，向 d 寄存器存储会清除 v 寄存器的高半部分，因此四条 movi 指令的效果就是清除了那四个寄存器。完成这些操作后，控制权将返回给调用者。

你或许会疑惑为何这段代码不清除 x0。答案很简单：x0 在此场景中保存着 self（即 nil），所以它已经是零值！由于 x0 已包含我们所需的值，省略这条清除指令可以节省一次操作。

那么对于无法存入寄存器的大型结构体返回值呢？这需要调用方进行一些配合。大型结构体的返回机制是：由调用方预先分配足够容纳返回值的内存，然后将该内存地址存入 x8 寄存器传递。被调用函数随后向该内存区域写入返回值。objc_msgSend 无法清除这块内存，因为它不知道返回值的具体大小。为此，编译器会生成在调用 objc_msgSend 前将内存清零的代码。

至此，nil 处理流程以及整个 objc_msgSend 的讲解就结束了。

结论

深入探索框架内部机制总是充满趣味。特别是 objc_msgSend，堪称一件艺术品，阅读其源码令人愉悦。

今天的内容就到这里。下次我们将继续探索更多精妙的底层细节。周五问答系列由读者提问驱动，如果你有希望在此探讨的话题，请随时投稿！

#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2017-06-30-dissecting-objc_msgsend-on-arm64.html

We’re back! During the week of WWDC, I spoke at CocoaConf Next Door, and one of my talks involved a dissection of objc_msgSend’s ARM64 implementation. I thought that turning it into an article would make for a nice return to blogging for Friday Q&A.

OverviewEvery Objective-C object has a class, and every Objective-C class has a list of methods. Each method has a selector, a function pointer to the implementation, and some metadata. The job of objc_msgSend is to take the object and selector that’s passed in, look up the corresponding method’s function pointer, and then jump to that function pointer.

Looking up a method can be extremely complicated. If a method isn’t found on a class, then it needs to continue searching in the superclasses. If no method is found at all, then it needs to call into the runtime’s message forwarding code. If this is the very first message being sent to a particular class, then it has to call that class’s +initialize method.

Looking up a method also needs to be extremely fast in the common case, since it’s done for every method call. This, of course, is in conflict with the complicated lookup process.

Objective-C’s solution to this conflict is the method cache. Each class has a cache which stores methods as pairs of selectors and function pointers, known in Objective-C as IMPs. They’re organized as a hash table so lookups are fast. When looking up a method, the runtime first consults the cache. If the method isn’t in the cache, it follows the slow, complicated procedure, and then places the result into the cache so that the next time can be fast.

objc_msgSend is written in assembly. There are two reasons for this: one is that it’s not possible to write a function which preserves unknown arguments and jumps to an arbitrary function pointer in C. The language just doesn’t have the necessary features to express such a thing. The other reason is that it’s extremely important for objc_msgSend to be fast, so every last instruction of it is written by hand so it can go as fast as possible.

Naturally, you don’t want to write the whole complicated message lookup procedure in assembly langauge. It’s not necessary, either, because things are going to be slow no matter what the moment you start going through it. The message send code can be divided into two parts: there’s the fast path in objc_msgSend itself, which is written in assembly, and the slow path implemented in C. The assembly part looks up the method in the cache and jump to it if it’s found. If the method is not in the cache, then it calls into the C code to handle things.

Therefore, when looking at objc_msgSend itself, it does the following:

Get the class of the object passed in.
Get the method cache of that class.
Use the selector passed in to look up the method in the cache.
If it’s not in the cache, call into the C code.
Jump to the IMP for the method.

How does it do all of that? Let’s see!

Instruction by Instructionobjc_msgSend has a few different paths it can take depending on circumstances. It has special code for handling things like messages to nil, tagged pointers, and hash table collisions. I’ll start by looking at the most common, straight-line case where a message is sent to a non-nil, non-tagged pointer and the method is found in the cache without any need to scan. I’ll note the various branching-off points as we go through them, and then once we’re done with the common path I’ll circle back and look at all of the others.

I’ll list each instruction or group of instructions followed by a description of what it does and why. Just remember to look up to find the instruction any given piece of text is discussing.

Each instruction is preceded by its offset from the beginning of the function. This serves as a counter, and lets you identify jump targets.

ARM64 has 31 integer registers which are 64 bits wide. They’re referred to with the notation x0 through x30. It’s also possible to access the lower 32 bits of each register as if it were a separate register, using w0 through w30. Registers x0 through x7 are used to pass the first eight parameters to a function. That means that objc_msgSend receives the self parameter in x0 and the selector _cmd parameter in x1.

Let’s begin!

1
    0x0000 cmp     x0, #0x0
2
    0x0004 b.le    0x6c

This performs a signed comparison of self with 0 and jumps elsewhere if the value is less than or equal to zero. A value of zero is nil, so this handles the special case of messages to nil. This also handles tagged pointers. Tagged pointers on ARM64 are indicated by setting the high bit of the pointer. (This is an interesting contrast with x86-64, where it’s the low bit.) If the high bit is set, then the value is negative when interpreted as a signed integer. For the common case of self being a normal pointer, the branch is not taken.

1
    0x0008 ldr    x13, [x0]

This loads self’s isa by loading the 64-bit quantity pointed to by x0, which contains self. The x13 register now contains the isa.

1
    0x000c and    x16, x13, #0xffffffff8

ARM64 can use non-pointer isas. Traditionally the isa points to the object’s class, but non-pointer isa takes advantage of spare bits by cramming some other information into the isa as well. This instruction performs a logical AND to mask off all the extra bits, and leaves the actual class pointer in x16.

1
    0x0010 ldp    x10, x11, [x16, #0x10]

This is my favorite instruction in objc_msgSend. It loads the class’s cache information into x10 and x11. The ldp instruction loads two registers’ worth of data from memory into the registers named in the first two arguments. The third argument describes where to load the data, in this case at offset 16 from x16, which is the area of the class which holds the cache information. The cache itself looks like this:

1
    typedef uint32_t mask_t;
2

3
    struct cache_t {
4
        struct bucket_t *_buckets;
5
        mask_t _mask;
6
        mask_t _occupied;
7
    }

Following the ldp instruction, x10 contains the value of _buckets, and x11 contains _occupied in its high 32 bits, and _mask in its low 32 bits.

_occupied specifies how many entries the hash table contains, and plays no role in objc_msgSend. _mask is important: it describes the size of the hash table as a convenient AND-able mask. Its value is always a power of two minus 1, or in binary terms something that looks like 000000001111111 with a variable number of 1s at the end. This value is needed to figure out the lookup index for a selector, and to wrap around the end when searching the table.

1
    0x0014 and    w12, w1, w11

This instruction computes the starting hash table index for the selector passed in as _cmd. x1 contains _cmd, so w1 contains the bottom 32 bits of _cmd. w11 contains _mask as mentioned above. This instruction ANDs the two together and places the result into w12. The result is the equivalent of computing _cmd % table_size but without the expensive modulo operation.

1
    0x0018 add    x12, x10, x12, lsl #4

The index is not enough. To start loading data from the table, we need the actual address to load from. This instruction computes that address by adding the table index to the table pointer. It shifts the table index left by 4 bits first, which multiplies it by 16, because each table bucket is 16 bytes. x12 now contains the address of the first bucket to search.

1
    0x001c ldp    x9, x17, [x12]

Our friend ldp makes another appearance. This time it’s loading from the pointer in x12, which points to the bucket to search. Each bucket contains a selector and an IMP. x9 now contains the selector for the current bucket, and x17 contains the IMP.

1
    0x0020 cmp    x9, x1
2
    0x0024 b.ne   0x2c

These instructions compare the bucket’s selector in x9 with _cmd in x1. If they’re not equal then this bucket does not contain an entry for the selector we’re looking for, and in that case the second instruction jumps to offset 0x2c, which handles non-matching buckets. If the selectors do match, then we’ve found the entry we’re looking for, and execution continues with the next instruction.

1
    0x0028 br    x17

This performs an unconditional jump to x17, which contains the IMP loaded from the current bucket. From here, execution will continue in the actual implementation of the target method, and this is the end of objc_msgSend’s fast path. All of the argument registers have been left undisturbed, so the target method will receive all passed in arguments just as if it had been called directly.

When everything is cached and all the stars align, this path can execute in less than 3 nanoseconds on modern hardware.

That’s the fast path, how about the rest of the code? Let’s continue with the code for a non-matching bucket.

1
    0x002c cbz    x9, __objc_msgSend_uncached

x9 contains the selector loaded from the bucket. This instruction compares it with zero and jumps to __objc_msgSend_uncached if it’s zero. A zero selector indicates an empty bucket, and an empty bucket means that the search has failed. The target method isn’t in the cache, and it’s time to fall back to the C code that performs a more comprehensive lookup. __objc_msgSend_uncached handles that. Otherwise, the bucket doesn’t match but isn’t empty, and the search continues.

1
    0x0030 cmp    x12, x10
2
    0x0034 b.eq   0x40

This instruction compares the current bucket address in x12 with the beginning of the hash table in x10. If they match, it jumps to code that wraps the search back to the end of the hash table. We haven’t seen it yet, but the hash table search being performed here actually runs backwards. The search examines decreasing indexes until it hits the beginning of the table, then it starts over at the end. I’m not sure why it works this way rather than the more common approach of increasing addresses that wrap to the beginning, but it’s a safe bet that it’s because it ends up being faster this way.

Offset 0x40 handles the wraparound case. Otherwise, execution proceeds to the next instruction.

1
    0x0038 ldp    x9, x17, [x12, #-0x10]!

Another ldp, once again loading a cache bucket. This time, it loads from offset 0x10 to the address of the current cache bucket. The exclamation point at the end of the address reference is an interesting feature. This indicates a register write-back, which means that the register is updated with the newly computed value. In this case, it’s effectively doing x12 -= 16 in addition to loading the new bucket, which makes x12 point to that new bucket.

1
    0x003c b      0x20

Now that the new bucket is loaded, execution can resume with the code that checks to see if the current bucket is a match. This loops back up to the instruction labeled 0x0020 above, and runs through all of that code again with the new values. If it continues to find non-matching buckets, this code will keep running until it finds a match, an empty bucket, or hits the beginning of the table.

1
    0x0040 add    x12, x12, w11, uxtw #4

This is the target for when the search wraps. x12 contains a pointer to the current bucket, which in this case is also the first bucket. w11 contains the table mask, which is the size of the table. This adds the two together, while also shifting w11 left by 4 bits, multiplying it by 16. The result is that x12 now points to the end of the table, and the search can resume from there.

1
    0x0044 ldp    x9, x17, [x12]

The now-familiar ldp loads the new bucket into x9 and x17.

1
    0x0048 cmp    x9, x1
2
    0x004c b.ne   0x54
3
    0x0050 br     x17

This code checks to see if the bucket matches and jumps to the bucket’s IMP. It’s a duplicate of the code at 0x0020 above.

1
    0x0054 cbz    x9, __objc_msgSend_uncached

Just like before, if the bucket is empty then it’s a cache miss and execution proceeds into the comprehensive lookup code implemented in C.

1
    0x0058 cmp    x12, x10
2
    0x005c b.eq   0x68

This checks for wraparound again, and jumps to 0x68 if we’ve hit the beginning of the table a second time. In this case, it jumps into the comprehensive lookup code implemented in C:

1
    0x0068 b      __objc_msgSend_uncached

This is something that should never actually happen. The table grows as entries are added to it, and it’s never 100% full. Hash tables become inefficient when they’re too full because collisions become too common.

Why is this here? A comment in the source code explains:

Clone scanning loop to miss instead of hang when cache is corrupt. The slow path may detect any corruption and halt later.

I doubt that this is common, but evidently the folks at Apple have seen memory corruption which caused the cache to be filled with bad entries, and jumping into the C code improves the diagnostics.

The existence of this check should have minimal impact on code that doesn’t suffer from this corruption. Without it, the original loop could be reused, which would save a bit of instruction cache space, but the effect is minimal. This wraparound handler is not the common case anyway. It will only be invoked for selectors that get sorted near the beginning of the hash table, and then only if there’s a collision and all the prior entries are occupied.

1
    0x0060 ldp    x9, x17, [x12, #-0x10]!
2
    0x0064 b      0x48

The remainder of this loop is the same as before. Load the next bucket into x9 and x17, update the bucket pointer in x12, and go back to the top of the loop.

That’s the end of the main body of objc_msgSend. What remains are special cases for nil and tagged pointers.

Tagged Pointer HandlerYou’ll recall that the very first instructions checked for those and jumped to offset 0x6c to handle them. Let’s continue from there:

1
    0x006c b.eq    0xa4

We’ve arrived here because self is less than or equal to zero. Less than zero indicates a tagged pointer, and zero is nil. The two cases are handled completely differently, so the first thing the code does here is check to see whether self is nil or not. If self is equal to zero then this instruction branches to 0xa4, which is where the nil handler lives. Otherwise, it’s a tagged pointer, and execution continues with the next instruction.

Before we move on, let’s briefly discuss how tagged pointers work. Tagged pointers support multiple classes. The top four bits of the tagged pointer (on ARM64) indicate which class the “object” is. They are essentially the tagged pointer’s isa. Of course, four bits isn’t nearly enough to hold a class pointer. Instead, there’s a special table which stores the available tagged pointer classes. The class of a tagged pointer “object” is found by looking up the index in that table which corresponds to the top four bits.

This isn’t the whole story. Tagged pointers (at least on ARM64) also support extended classes. When the top four bits are all set to 1 the next eight bits are used to index into an extended tagged pointer class table. This allows the runtime to support more tagged pointer classes, at the cost of having less storage for them.

Let’s continue.

1
    0x0070 mov    x10, #-0x1000000000000000

This sets x10 to an integer value with the top four bits set and all other bits set to zero. This will serve as a mask to extract the tag bits from self.

1
    0x0074 cmp    x0, x10
2
    0x0078 b.hs   0x90

This checks for an extended tagged pointer. If self is greater than or equal to the value in x10, then that means the top four bits are all set. In that case, branch to 0x90 which will handle extended classes. Otherwise, use the primary tagged pointer table.

1
    0x007c adrp   x10, _objc_debug_taggedpointer_classes@PAGE
2
    0x0080 add    x10, x10, _objc_debug_taggedpointer_classes@PAGEOFF

This little song and dance loads the address of _objc_debug_taggedpointer_classes, which is the primary tagged pointer table. ARM64 requires two instructions to load the address of a symbol. This is a standard technique on RISC-like architectures. Pointers on ARM64 are 64 bits wide, and instructions are only 32 bits wide. It’s not possible to fit an entire pointer into one instruction.

x86 doesn’t suffer from this problem, since it has variable-length instructions. It can just use a 10-byte instruction, where two bytes identify the instruction itself and the target register, and eight bytes hold the pointer value.

On a machine with fixed-length instructions, you load the value in pieces. In this case, only two pieces are needed. The adrp instruction loads the top part of the value, and the add then adds in the bottom part.

1
    0x0084 lsr    x11, x0, #60

The tagged class index is in the top four bits of x0. To use it as an index, it has to be shifted right by 60 bits so it becomes an integer in the range 0-15. This instruction performs that shift and places the index into x11.

1
    0x0088 ldr    x16, [x10, x11, lsl #3]

This uses the index in x11 to load the entry from the table that x10 points to. The x16 register now contains the class of this tagged pointer.

1
    0x008c b      0x10

With the class in x16, we can now branch back to the main code. The code starting with offset 0x10 assumes that the class pointer is loaded into x16 and performs dispatch from there. The tagged pointer handler can therefore just branch back to that code rather than duplicating logic here.

1
    0x0090 adrp   x10, _objc_debug_taggedpointer_ext_classes@PAGE
2
    0x0094 add    x10, x10, _objc_debug_taggedpointer_ext_classes@PAGEOFF

The extended tagged class handler looks similar. These two instructions load the pointer to the extended table.

1
    0x0098 ubfx   x11, x0, #52, #8

This instruction loads the extended class index. It extracts 8 bits starting from bit 52 in self into x11.

1
    0x009c ldr    x16, [x10, x11, lsl #3]

Just like before, that index is used to look up the class in the table and load it into x16.

1
    0x00a0 b      0x10

With the class in x16, it can branch back into the main code.

That’s nearly everything. All that remains is the nil handler.

nil HandlerFinally we get to the nil handler. Here it is, in its entirety.

1
    0x00a4 mov    x1, #0x0
2
    0x00a8 movi   d0, #0000000000000000
3
    0x00ac movi   d1, #0000000000000000
4
    0x00b0 movi   d2, #0000000000000000
5
    0x00b4 movi   d3, #0000000000000000
6
    0x00b8 ret

The nil handler is completely different from the rest of the code. There’s no class lookup or method dispatch. All it does for nil is return 0 to the caller.

This task is a bit complicated by the fact that objc_msgSend doesn’t know what kind of return value the caller expects. Is this method returning one integer, or two, or a floating-point value, or nothing at all?

Fortunately, all of the registers used for return values can be safely overwritten even if they’re not being used for this particular call’s return value. Integer return values are stored in x0 and x1 and floating point return values are stored in vector registers v0 through v3. Multiple registers are used for returning smaller structs.

This code clears x1 and v0 through v3. The d0 through d3 registers refer to the bottom half of the corresponding v registers, and storing into them clears the top half, so the effect of the four movi instructions is to clear those four registers. After doing this, it returns control to the caller.

You might wonder why this code doesn’t clear x0. The answer to that is simple: x0 holds self which in this case is nil, so it’s already zero! You can save an instruction by not clearing x0 since it already holds the value we want.

What about larger struct returns that don’t fit into registers? This requires a little cooperation from the caller. Large struct returns are performed by having the caller allocate enough memory for the return value, and then passing the address of that memory in x8. The function then writes to that memory to return a value. objc_msgSend can’t clear this memory, because it doesn’t know how big the return value is. To solve this, the compiler generates code which fills the memory with zeroes before calling objc_msgSend.

That’s the end of the nil handler, and of objc_msgSend as a whole.

ConclusionIt’s always interesting to dive into framework internals. objc_msgSend in particular is a work of art, and delightful to read through.

That’s it for today. Come back next time for more squishy goodness. Friday Q&A is driven by reader input, so if you have something you’d like to see discussed here, send it in!