Objective-C 运行时中的并发内存释放

文章發布時間 2015年5月29日

作者 TommyWu

標籤

译文 · 原文： Friday Q&A 2015-05-29: Concurrent Memory Deallocation in the Objective-C Runtime · 作者 Mike Ash

原文：https://www.mikeash.com/pyblog/friday-qa-2015-05-29-concurrent-memory-deallocation-in-the-objective-c-runtime.html 发布：2015-05-29　作者：Mike Ash 译者：MiMo（mimo-v2.5-pro）；代码块保留英文原样

Objective-C 运行时（Runtime）是众多 Mac 和 iOS 代码的核心。而运行时的核心则是 objc_msgSend 函数，该函数的核心又在于方法缓存（method cache）。今天我将探讨苹果公司如何在不影响性能的前提下，以线程安全的方式管理方法缓存内存的重新分配与释放 —— 你或许无法在讨论线程安全的教科书中找到这种技术。

概念上的消息发送 objc_msgSend 通过为被发送的方法查找相应的方法实现（method implementation），然后跳转到该实现来工作。从概念上讲，方法的查找过程如下：

1
    IMP lookUp(id obj, SEL selector) {
2
        Class c = object_getClass(obj);
3

4
        while(c) {
5
            for(int i = 0; i < c->numMethods; i++) {
6
                Method m = c->methods[i];
7
                if(m.selector == selector) {
8
                    return m.imp;
9
                }
10
            }
11

12
            c = c->superclass;
13
        }
14

15
        return _objc_msgForward;
16
    }

某些名称已被修改以保护相关方。若你希望查看实际代码，可参考 Objective-C 运行时源码：

http://www.opensource.apple.com/source/objc4/

方法缓存 大多数 Objective-C 代码会频繁发送消息。若每次都执行完整的消息搜索，其速度将慢到难以置信。

解决方案是引入缓存。每个类都附加一个哈希表（hash table），用于将选择子（selector）映射到方法实现（method implementation）。该哈希表针对读取效率进行了优化，且 objc_msgSend 使用精心调校的汇编代码快速执行哈希表查找。这使得已缓存消息的发送耗时可降至个位数纳秒。首次发送某条消息时仍然会慢得惊人，但后续调用就会很快。

当我们想到缓存时，它通常是一种容量有限、旨在加速近期使用资源多次访问的机制。例如，你可能会缓存从网络加载的图片，这样快速连续两次获取时就不会重复访问网络。然而，你不会希望占用过多内存，所以可能会限制缓存中图片的最大数量，当缓存满时，新图片加入就会替换掉最旧的那张。

对于许多问题来说，这是个不错的方法，但它可能会带来不幸的性能影响。例如，如果你将图片缓存设置为存储 40 张，而你的应用却在不断循环使用 41 张图片，那么你的缓存突然变得完全无用。

对于自己的应用，我们可以通过测试和调整缓存来避免这种情况，但 Objective-C Runtime（Objective-C 运行时）没有这个选项。因为方法缓存（method cache）对性能至关重要，且每个条目相对较小，运行时不会对缓存施加任何大小限制，而是根据需要进行扩展，以缓存所有已经发送过的消息（message）。

请注意，缓存有时确实会被刷新；每当发生任何可能导致缓存数据过期（stale）的事件时 —— 例如向进程加载新代码或修改某个类的方法列表（method lists）—— 相应的缓存会被销毁并允许重新填充。

调整大小、释放与线程 调整缓存大小在概念上非常简单。它看起来像这样：

1
    bucket_t *newCache = malloc(newSize);
2
    copyEntries(newCache, class->cache);
3
    free(class->cache);
4
    class->cache = newCache;

Objective-C 运行时在这里实际上采用了一个小小的快捷方式：它甚至不会将旧条目复制到新的缓存中！毕竟这只是个缓存，没有要求必须保留其中包含的数据。当消息被发送时，条目会重新填充。所以实际上只是：

1
    free(class->cache);
2
    class->cache = malloc(newSize);

在单线程环境中，这便是全部所需内容，本文本可就此结束。但 Objective-C 运行时必须支持多线程代码，这意味着上述所有代码都必须保证线程安全。任何给定类的缓存都可能被多个线程同时访问，因此代码必须确保能够容忍这种并发场景。

按目前的实现方式，这是无法保证的。在释放旧缓存与赋值新缓存之间存在一个时间窗口期，其他线程可能在此期间访问到无效的缓存指针。这可能导致它们读取到垃圾数据，甚至在底层内存已被解映射的情况下直接引发崩溃。

如何解决这个问题？保护此类共享数据的典型方法是使用锁（lock）。修改后的代码将呈现如下形式：

1
    lock(class->lock);
2
    free(class->cache);
3
    class->cache = malloc(newSize);
4
    unlock(class->lock);

所有访问都必须通过锁来保护，包括读取操作，这样机制才能生效。这意味着 objc_msgSend 必须先获取锁，查找缓存，然后再释放锁。考虑到缓存查找本身只需要几个纳秒，每次获取和释放锁会增加大量开销。性能影响实在太高了。

我们或许可以尝试用其他方式来弥补这个漏洞。例如，如果我们先分配并赋值新的缓存，然后再释放旧的缓存呢？

1
    bucket_t *oldCache = class->cache;
2
    class->cache = malloc(newSize);
3
    free(oldCache);

这有所帮助，但并不能解决问题。另一个线程可能会获取到旧的缓存指针，然后在访问其内容之前被系统抢占。接着旧缓存可能在其他线程再次运行之前被销毁，从而导致与之前相同的问题。

如果我们引入一个延迟呢？比如这样：

1
    bucket_t *oldCache = class->cache;
2
    class->cache = malloc(newSize);
3
    after(5 /* seconds */, ^{
4
        free(oldCache);
5
    });

这几乎肯定能解决问题。但仍可设想，一个线程可能恰好在那个关键瞬间被抢占，并且被抢占的时间足够长，以至于五秒延迟先到期。这使得崩溃概率极低，但并未完全消除。

与其使用任意延迟，不如等到窗口期确实过去？让我们在 objc_msgSend 中添加一个计数器，使其看起来像是：

1
    gInMsgSend++;
2
    lookUpCache(class->cache);
3
    gInMsgSend--;

一个正确线程安全的版本需要为计数器使用原子操作（atomics）以及合适的内存屏障（memory barriers），以确保关联的加载 / 存储操作能正确显现。为了本文讨论的目的，请假设这些机制都已到位。

有了计数器后，缓存重新分配（cache reallocation）的过程看起来会是：

1
    bucket_t *oldCache = class->cache;
2
    class->cache = malloc(newSize);
3
    while(gInMsgSend)
4
        ; // spin
5
    free(oldCache);

注意，无需阻塞 objc_msgSend 的执行即可保证此机制正常运作。一旦缓存释放代码确认在替换了缓存指针后的任何特定时刻都没有线程停留在 objc_msgSend 中，它就可以安全地释放旧的缓存。在旧缓存指针被释放期间，另一个线程可能会调用 objc_msgSend，但这次新调用不可能再看到旧指针，因此是安全的。

自旋（spinning）既低效又不够优雅。释放这些缓存并非特别紧急的需求。虽然释放内存是好事，但即使需要一些时间才能完成也无伤大雅。与其采用自旋等待，不如维护一个待释放缓存的列表，每次有内容被释放时，就尝试清理所有待处理的缓存：

1
    bucket_t *oldCache = class->cache;
2
    class->cache = malloc(newSize);
3

4
    append(gOldCachesList, oldCache);
5
    if(!gInMsgSend) {
6
        for(cache in gOldCachesList) {
7
            free(cache);
8
        }
9
        gOldCachesList.clear();
10
    }

如果消息发送（message send）正在进行中，那么这不会立即释放旧缓存（cache），但这不是问题。下一次循环时它会被清除，或者再下一次，或者在未来的某个时刻。这个版本非常接近 Objective-C runtime（Objective-C 运行时）的实际做法。

零成本标志

这里两个交互部分之间存在极端的不对称。objc_msgSend 一侧每秒可能运行数百万次，确实需要尽可能快。单次调用的最佳情况运行时间仅为几纳秒。另一方面，调整缓存大小是一个罕见的操作，随着应用继续运行，通常会越来越不常见。一旦应用达到稳定状态，不再加载新代码或编辑消息列表，并且缓存大小满足需求，它就不会再发生。在此之前，随着缓存增长到所需大小，它可能会发生数百或数千次，但与 objc_msgSend 相比，这是极其罕见的，并且对性能的敏感度要低得多。

由于这种不对称性，最好在消息发送侧尽可能减少开销，哪怕这会导致缓存释放部分变得慢得多。在objc_msgSend中节省一个 CPU 周期，代价是每次缓存释放操作增加一百万个 CPU 周期，这仍然是一个巨大的净收益。

即使使用全局计数器（global counter）也过于昂贵。这将在objc_msgSend内部增加两次额外的内存访问，依然会带来巨大的开销。这些操作需要是原子操作（atomic）并使用内存屏障（memory barriers），这使得情况更糟。幸运的是，Objective-C 运行时有一种技术可以将objc_msgSend侧的开销降低为零，代价是使缓存释放代码变得慢得多。

假设的全局计数器的目的是追踪是否有任何线程处于代码的特定区域内。而线程本身已经拥有追踪其当前正在运行代码的机制：程序计数器（program counter）。这是一个追踪当前指令内存地址的 CPU 寄存器。与其使用全局计数器，我们可以检查每个线程的程序计数器，看看它是否位于objc_msgSend内部。如果所有线程都处于其外部，那么释放旧缓存就是安全的。该实现可能如下所示：

1
    BOOL ThreadsInMsgSend(void) {
2
        for(thread in GetAllThreads()) {
3
            uintptr_t pc = thread.GetPC();
4
            if(pc >= objc_msgSend_startAddress && pc <= objc_msgSend_endAddress) {
5
                return YES;
6
            }
7
        }
8
        return NO;
9
    }
10

11
    bucket_t *oldCache = class->cache;
12
    class->cache = malloc(newSize);
13

14
    append(gOldCachesList, oldCache);
15
    if(!ThreadsInMsgSend()) {
16
        for(cache in gOldCachesList) {
17
            free(cache);
18
        }
19
        gOldCachesList.clear();
20
    }

然后 objc_msgSend 就完全不需要执行任何特殊操作了。它可以直接访问缓存，而无需担心对该访问进行标记。它只需要这样做：

1
    lookUpCache(class->cache);

缓存清除代码的效率相当低下，因为它需要检查进程中每个线程的状态。但objc_msgSend的效率就像为单线程环境编写的那样高效，这是一个非常值得的权衡。这最终就是苹果运行时代码的工作原理。（译注：objc_msgSend的线程安全缓存清除机制在现代系统中可能已优化）

实际实现 苹果对这一技术的实现在运行时函数_collecting_in_critical中，位于objc-cache.mm文件。

关键的 PC（程序计数器）位置存储在全局变量中：

1
    OBJC_EXPORT uintptr_t objc_entryPoints[];
2
    OBJC_EXPORT uintptr_t objc_exitPoints[];

实际上存在多个 objc_msgSend 的实现版本（例如处理结构体返回值的情况），而内部的cache_getImp函数也会直接访问缓存。为了安全地释放缓存，所有这些实现都必须被检查。

该函数本身不接受参数，返回值是 int 类型，仅用作布尔标志以指示是否有线程正处于某个临界函数中：

1
    static int _collecting_in_critical(void)
2
    {

为了聚焦于最关键的代码部分，我将跳过这个函数中相对次要的代码片段。若想查看完整内容，可访问 opensource.apple.com。

获取线程信息的 API 位于 mach（内核抽象层）层级。task_threads 函数可获取指定任务（mach 对进程的称呼）中的所有线程列表，此处代码通过它来获取当前进程内的线程：

1
        ret = task_threads(mach_task_self(), &threads, &number);

该函数在 threads 参数中返回一个包含 thread_t 值的数组，number 参数则返回线程数量。随后它会循环遍历这些线程：

1
        for (count = 0; count < number; count++)
2
        {

获取线程的程序计数器（PC）值是通过一个单独的函数完成的，我们稍后会详细探讨这个函数：

1
            pc = _get_pc_for_thread (threads[count]);

然后遍历每个入口点和出口点，并与之逐一比较：

1
            for (region = 0; objc_entryPoints[region] != 0; region++)
2
            {
3
                if ((pc >= objc_entryPoints[region]) &&
4
                    (pc <= objc_exitPoints[region]))
5
                {
6
                    result = TRUE;
7
                    goto done;
8
                }
9
            }
10
        }

在循环结束后，将结果返回给调用者：

1
        return result;
2
    }

_get_pc_for_thread 是如何工作的？它是一段相对简单的代码，通过调用 thread_get_state 来获取目标线程的寄存器状态（register state）。将其放在独立函数中的主要原因在于寄存器状态结构体是架构特定的（architecture-specific），因为每种架构拥有不同的寄存器。这意味着该函数需要为每种支持的架构分别实现，尽管这些实现方式几乎相同。以下是 x86-64 的实现：

1
    static uintptr_t _get_pc_for_thread(thread_t thread)
2
    {
3
        x86_thread_state64_t            state;
4
        unsigned int count = x86_THREAD_STATE64_COUNT;
5
        kern_return_t okay = thread_get_state (thread, x86_THREAD_STATE64, (thread_state_t)&state, &count);
6
        return (okay == KERN_SUCCESS) ? state.__rip : PC_SENTINEL;
7
    }

请注意，rip 是 x86-64 架构中 PC（程序计数器）的寄存器名称；其中 R 代表 “register”（寄存器），IP 代表 “instruction pointer”（指令指针）。

入口点和出口点本身定义在相关函数所在的汇编语言文件中。它们看起来像这样：

1
    .private_extern _objc_entryPoints
2
    _objc_entryPoints:
3
        .quad   _cache_getImp
4
        .quad   _objc_msgSend
5
        .quad   _objc_msgSend_fpret
6
        .quad   _objc_msgSend_fp2ret
7
        .quad   _objc_msgSend_stret
8
        .quad   _objc_msgSendSuper
9
        .quad   _objc_msgSendSuper_stret
10
        .quad   _objc_msgSendSuper2
11
        .quad   _objc_msgSendSuper2_stret
12
        .quad   0
13

14
    .private_extern _objc_exitPoints
15
    _objc_exitPoints:
16
        .quad   LExit_cache_getImp
17
        .quad   LExit_objc_msgSend
18
        .quad   LExit_objc_msgSend_fpret
19
        .quad   LExit_objc_msgSend_fp2ret
20
        .quad   LExit_objc_msgSend_stret
21
        .quad   LExit_objc_msgSendSuper
22
        .quad   LExit_objc_msgSendSuper_stret
23
        .quad   LExit_objc_msgSendSuper2
24
        .quad   LExit_objc_msgSendSuper2_stret
25
        .quad   0

_collecting_in_critical 的用法与上面的假想示例类似。它在释放剩余缓存垃圾的代码之前被调用。实际上，运行时（runtime）有两种独立的模式：一种是如果其他线程处于关键函数（critical function）中，则将垃圾留待下次处理；另一种则是在循环中等待，直到安全为止，并总是会释放垃圾：

1
    // Synchronize collection with objc_msgSend and other cache readers
2
    if (!collectALot) {
3
        if (_collecting_in_critical ()) {
4
            // objc_msgSend (or other cache reader) is currently looking in
5
            // the cache and might still be using some garbage.
6
            if (PrintCaches) {
7
                _objc_inform ("CACHES: not collecting; "
8
                              "objc_msgSend in progress");
9
            }
10
            return;
11
        }
12
    }
13
    else {
14
        // No excuses.
15
        while (_collecting_in_critical())
16
            ;
17
    }
18

19
    // free garbage here

第一种模式（遗留垃圾供下次使用）用于常规缓存重调大小。始终释放垃圾的自旋模式则用于 Runtime 中清空所有类缓存的方法，因为这通常会产生大量垃圾。根据我对代码的检查，这仅在启用将所有消息发送记录到文件的调试日志功能时发生 —— 由于消息缓存会干扰日志记录，因此需要清空缓存。

结论
性能与线程安全往往相互矛盾。代码的不同部分访问共享数据时常存在不对称性，这为实现更高效的线程安全提供了可能。使用全局标志或计数器来指示何时变异操作不安全，是利用这种不对称性的一种方式。在 Objective-C Runtime 中，Apple 更进一步，利用每个线程的程序计数器（program counter）作为线程正在执行不安全操作的隐式指示。（译注：现代 Runtime 实现可能已采用其他线程安全优化机制）这是一个特例，很难看出该技术还能应用于其他场景，但剖析其原理仍令人着迷。

今天的内容就到这里。下次再带来更多精彩的 Friday Q&A（周五技术问答）。本专栏由读者的想法驱动，如果你有希望在此看到的话题，请尽管投稿！

#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2015-05-29-concurrent-memory-deallocation-in-the-objective-c-runtime.html

The Objective-C runtime is at the heart of much Mac and iOS code. At the heart of the runtime is the objc_msgSend function, and the heart of that is the method cache. Today I’m going to explore how Apple manages resizing and deallocating method cache memory in a thread safe manner without impacting performance, using a technique you probably won’t find in textbooks discussing thread safety.

Message Sending in Conceptobjc_msgSend works by looking up the appropriate method implementation for the method being sent, and then jumping to it. Conceptually, looking up the method works like this:

1
    IMP lookUp(id obj, SEL selector) {
2
        Class c = object_getClass(obj);
3

4
        while(c) {
5
            for(int i = 0; i < c->numMethods; i++) {
6
                Method m = c->methods[i];
7
                if(m.selector == selector) {
8
                    return m.imp;
9
                }
10
            }
11

12
            c = c->superclass;
13
        }
14

15
        return _objc_msgForward;
16
    }

Some names have been changed to protect the innocent. If you’re interested in seeing the real code, check out the Objective-C runtime source code:

http://www.opensource.apple.com/source/objc4/

Method CacheMost Objective-C code sends messages all over the place. If the full message search was performed for each one, it would be unbelievably slow.

The solution to this is a cache. Each class has a hash table attached to it which maps selectors to method implementations. The hash table is built for maximum read efficiency, and objc_msgSend uses carefully tuned assembly language code to perform the hash table lookup quickly. This gets a message send in the cached case down to single-digit nanoseconds. The first use of any given message is still unbelievably slow, but after that it’s fast.

When we think of a cache, it’s usually something with a limited size that’s intended to speed up multiple accesses to recently used resources. For example, you might cache images that you load from the internet so that two fetches in quick succession don’t hit the network twice. You don’t want to use too much memory, though, so you might cap the number of images you keep in the cache at any given time, and throw away the oldest image when a new one comes in after it fills up.

This is a fine approach for many problems but it can have unfortunate performance implications. For example, if you set your image cache to store 40 images, and you run into a case where your application is constantly cycling through 41 images, your cache suddenly becomes completely useless.

For our own apps we can test and tune the caches to avoid this, but the Objective-C runtime doesn’t have this option. Because the method cache is so critical to performance, and because each entry is relatively small, the runtime doesn’t impose any size limit on the caches, and expands them as necessary to cache all messages that have been sent.

Note that the caches do sometimes get flushed; any time something happens that might cause the cached data to become stale, such as loading new code into the process or modifying a class’s method lists, the appropriate caches are destroyed and allowed to refill.

Resizing, Deallocation, and ThreadsResizing the cache is pretty simple in concept. It looks something like:

1
    bucket_t *newCache = malloc(newSize);
2
    copyEntries(newCache, class->cache);
3
    free(class->cache);
4
    class->cache = newCache;

The Objective-C runtime actually takes a small shortcut here: it doesn’t even copy the old entries into the new cache! It’s just a cache, after all, and there’s no requirement to preserve the data it contains. Entries refill as messages are sent. So it’s really just:

1
    free(class->cache);
2
    class->cache = malloc(newSize);

In a single-threaded environment, this would be all you need, and this article would be short. But of course the Objective-C runtime has to support multithreaded code, and that means that all of this code has to be thread safe. Any given class’s cache can be accessed simultaneously from multiple threads, so this code has to take care to ensure that it tolerates that scenario.

As written here, it won’t. There’s a window of opportunity after freeing the old cache and before assigning the new cache where another thread might access an invalid cache pointer. This could cause it to see garbage data, or even just crash immediately if the underlying memory was unmapped.

How can we solve this problem? The typical approach to protecting shared data like this is to use a lock. The code would then look like:

1
    lock(class->lock);
2
    free(class->cache);
3
    class->cache = malloc(newSize);
4
    unlock(class->lock);

All accesses must be gated by the lock, including reads, for this to work. That means that objc_msgSend would have to acquire the lock, look in the cache, and release the lock. Acquiring and releasing the lock each time would add a lot of overhead, considering that the cache lookup itself only takes a few nanoseconds. The performance impact is just too high.

We might try to close the window in some other way. For example, what if we allocated and assigned the new cache first, and then deallocated the old cache?

1
    bucket_t *oldCache = class->cache;
2
    class->cache = malloc(newSize);
3
    free(oldCache);

This helps, but it doesn’t solve the problem. Another thread might retrieve the old cache pointer, then get preempted by the system before it can access the contents. The old cache could then be destroyed before the other thread runs again, causing the same problems as before.

What if we put in a delay? Something like:

1
    bucket_t *oldCache = class->cache;
2
    class->cache = malloc(newSize);
3
    after(5 /* seconds */, ^{
4
        free(oldCache);
5
    });

This is almost certain to work. But it’s still conceivable that a thread might get preempted at just the right moment and stay preempted for long enough that the five-second delay fires first. This makes the crash extremely unlikely, but doesn’t completely eliminate it.

Rather than an arbitrary delay, how about waiting until the window is surely clear? Let’s add a counter to objc_msgSend so that it looks something like:

1
    gInMsgSend++;
2
    lookUpCache(class->cache);
3
    gInMsgSend--;

A proper thread safe version would need to use atomics for the counter and appropriate memory barriers to make sure the dependent loads/stores show up properly. For the purposes of this article, just imagine that stuff is there.

With the counter, cache reallocation would look like:

1
    bucket_t *oldCache = class->cache;
2
    class->cache = malloc(newSize);
3
    while(gInMsgSend)
4
        ; // spin
5
    free(oldCache);

Note that there is no need to block execution of objc_msgSend for this to work properly. Once the cache free code is sure that nothing is in objc_msgSend at any particular moment after it has replaced the cache pointer, it can go ahead and free the old one. Another thread might call out to objc_msgSend while the old cache pointer is being deallocated, but this new call can’t possibly see the old pointer anymore, so it’s safe.

Spinning is inefficient and inelegant. It’s not particularly urgent to free these caches. It’s nice to deallocate the memory, but it’s not terrible if it takes some time. Rather than spinning, let’s keep a list of unfreed caches, and each time something is freed, try to clear everything that’s pending:

1
    bucket_t *oldCache = class->cache;
2
    class->cache = malloc(newSize);
3

4
    append(gOldCachesList, oldCache);
5
    if(!gInMsgSend) {
6
        for(cache in gOldCachesList) {
7
            free(cache);
8
        }
9
        gOldCachesList.clear();
10
    }

If a message send is in progress then this won’t immediately free the old cache, but that’s not a problem. The next time through it will be cleared, or the time after that, or at some point in the future.

This version is pretty close to how the Objective-C runtime actually does it.

Zero-Cost FlagsThere’s an extreme asymmetry here between the two interacting parts. The objc_msgSend side runs potentially millions of times each second and really needs to be as fast as possible. The best case running time for a single call is just a few nanoseconds. On the other hand, resizing the cache is a rare operation that will typically get less and less common as an app continues to run. Once the app reaches a steady state, no longer loading new code or editing message lists and with the caches as big as they need to be, it’ll never happen. Before that, it may happen some hundreds or thousands of times as the caches grow to the size they need, but it’s extremely rare in comparison to objc_msgSend and vastly less performance sensitive.

Because of this asymmetry, it’s best to put as little as possible on the message send side, even if it makes the cache freeing part much slower. Shaving off one CPU cycle in objc_msgSend at the cost of a million CPU cycles in each cache free operation is a net win, by a huge margin.

Even a global counter is too costly. That’s two additional memory accesses within objc_msgSend which would still add a great deal of overhead. They would need to be atomic and use memory barriers which makes it even worse. Fortunately, the Objective-C runtime has a technique for reducing the cost on the objc_msgSend side to zero, at the expense of making the cache free code much slower.

The purpose of the hypothetical global counter is to track when any thread is within a particular region of code. Threads already have something that tracks what code they’re currently running: the program counter. This is the CPU register which tracks the memory address of the current instruction. Instead of a global counter, we could check each thread’s program counter to see if it’s within objc_msgSend. If all threads are outside, then it’s safe to free the old caches. Here’s what that implementation would look like:

1
    BOOL ThreadsInMsgSend(void) {
2
        for(thread in GetAllThreads()) {
3
            uintptr_t pc = thread.GetPC();
4
            if(pc >= objc_msgSend_startAddress && pc <= objc_msgSend_endAddress) {
5
                return YES;
6
            }
7
        }
8
        return NO;
9
    }
10

11
    bucket_t *oldCache = class->cache;
12
    class->cache = malloc(newSize);
13

14
    append(gOldCachesList, oldCache);
15
    if(!ThreadsInMsgSend()) {
16
        for(cache in gOldCachesList) {
17
            free(cache);
18
        }
19
        gOldCachesList.clear();
20
    }

Then objc_msgSend doesn’t have to do anything special at all. It can access the caches directly without worrying about flagging that access. It just does:

1
    lookUpCache(class->cache);

The cache free code is pretty inefficient because it needs to examine the state of every thread in the process. But objc_msgSend is as efficient as it would be if it were written for a single-threaded environment, and that’s a tradeoff well worth making. This is ultimately how Apple’s runtime code works.

The Real CodeApple’s implementation of this technique can be found in the runtime function _collecting_in_critical located in objc-cache.mm.

The critical PC locations are stored in global variables:

1
    OBJC_EXPORT uintptr_t objc_entryPoints[];
2
    OBJC_EXPORT uintptr_t objc_exitPoints[];

There are actually multiple objc_msgSend implementations (for things like struct returns), and the internal cache_getImp function also accesses the cache directly. They all need to be checked in order to safely deallocate caches.

The function itself takes no parameters and returns int, which is just used as a boolean flag to indicate whether any threads are in one of the critical functions or not:

1
    static int _collecting_in_critical(void)
2
    {

I’m going to skip over the less interesting bits of code in this function in the interest of concentrating on the best parts. If you want to see the whole thing, take a look at opensource.apple.com.

The APIs for getting thread information lie at the mach level. task_threads gets a list of all threads in a given task (mach’s term for a process), and this code uses it to get the threads in its own process:

1
        ret = task_threads(mach_task_self(), &threads, &number);

That returns an array of thread_t values in threads, and the number of threads in number. Then it loops over them:

1
        for (count = 0; count < number; count++)
2
        {

Fetching the PC for a thread is done in a separate function, which we’ll look at shortly:

1
            pc = _get_pc_for_thread (threads[count]);

It then loops over the entry and exit points and compares with each one:

1
            for (region = 0; objc_entryPoints[region] != 0; region++)
2
            {
3
                if ((pc >= objc_entryPoints[region]) &&
4
                    (pc <= objc_exitPoints[region]))
5
                {
6
                    result = TRUE;
7
                    goto done;
8
                }
9
            }
10
        }

After the loop, it returns the result to the caller:

1
        return result;
2
    }

How does _get_pc_for_thread work? It’s a relatively simple bit of code that calls thread_get_state to get the register state of the target thread. The main reason it’s in a separate function is because the register state structures are architecture-specific, since each architecture has different registers. That means this function needs a separate implementation for each supported architecture, although the implementations are almost identical. Here’s the implementation for x86-64:

1
    static uintptr_t _get_pc_for_thread(thread_t thread)
2
    {
3
        x86_thread_state64_t            state;
4
        unsigned int count = x86_THREAD_STATE64_COUNT;
5
        kern_return_t okay = thread_get_state (thread, x86_THREAD_STATE64, (thread_state_t)&state, &count);
6
        return (okay == KERN_SUCCESS) ? state.__rip : PC_SENTINEL;
7
    }

Note that rip is the register name of the PC on x86-64; the R stands for “register,” and the IP stands for “instruction pointer.”

The entry and exit points themselves are defined in the assembly language file where the functions in question are defined. They look like this:

1
    .private_extern _objc_entryPoints
2
    _objc_entryPoints:
3
        .quad   _cache_getImp
4
        .quad   _objc_msgSend
5
        .quad   _objc_msgSend_fpret
6
        .quad   _objc_msgSend_fp2ret
7
        .quad   _objc_msgSend_stret
8
        .quad   _objc_msgSendSuper
9
        .quad   _objc_msgSendSuper_stret
10
        .quad   _objc_msgSendSuper2
11
        .quad   _objc_msgSendSuper2_stret
12
        .quad   0
13

14
    .private_extern _objc_exitPoints
15
    _objc_exitPoints:
16
        .quad   LExit_cache_getImp
17
        .quad   LExit_objc_msgSend
18
        .quad   LExit_objc_msgSend_fpret
19
        .quad   LExit_objc_msgSend_fp2ret
20
        .quad   LExit_objc_msgSend_stret
21
        .quad   LExit_objc_msgSendSuper
22
        .quad   LExit_objc_msgSendSuper_stret
23
        .quad   LExit_objc_msgSendSuper2
24
        .quad   LExit_objc_msgSendSuper2_stret
25
        .quad   0

_collecting_in_critical is used much like in the hypothetical examples above. It’s called before the code that frees leftover cache garbage. The runtime actually has two separate modes: one which leaves the garbage for the next time if other threads are in a critical function, and one which spins in a loop until the coast is clear, and always deallocates the garbage:

1
    // Synchronize collection with objc_msgSend and other cache readers
2
    if (!collectALot) {
3
        if (_collecting_in_critical ()) {
4
            // objc_msgSend (or other cache reader) is currently looking in
5
            // the cache and might still be using some garbage.
6
            if (PrintCaches) {
7
                _objc_inform ("CACHES: not collecting; "
8
                              "objc_msgSend in progress");
9
            }
10
            return;
11
        }
12
    }
13
    else {
14
        // No excuses.
15
        while (_collecting_in_critical())
16
            ;
17
    }
18

19
    // free garbage here

The first mode, which leaves garbage for the next time, is used for normal cache resizes. The spin mode that always frees garbage is used in the runtime method that flushes all caches for all classes as this would typically generate a large amount of garbage. As best I can tell from examining the code, this only happens when enabling a debug logging facility that logs all message sends to a file. It flushes caches because the message cache interferes with the logging.

ConclusionPerformance and thread safety are often at odds with each other. Often there is asymmetry in how different parts of code access shared data, which allows more efficient thread safety. A global flag or counter that indicates when a mutating action is unsafe can be one way to exploit this. In the Objective-C runtime, Apple takes this a step further and uses the program counter of each thread as an implicit indication of when a thread is taking unsafe action. This is a specialized case and it’s hard to see where else the technique could be useful, but it’s fascinating to take apart.

That’s it for today. Check back next time for more exciting action. Friday Q&A is driven by reader ideas, so if you have an idea you’d like to see covered here, please send it in!