反汇编剖析（二）

文章發布時間 2011年12月23日

作者 TommyWu

標籤

译文 · 原文： Friday Q&A 2011-12-23: Disassembling the Assembly, Part 2 · 作者 Mike Ash

原文：https://www.mikeash.com/pyblog/friday-qa-2011-12-23-disassembling-the-assembly-part-2.html 发布：2011-12-23　作者：Mike Ash 译者：MiMo（mimo-v2.5-pro）；代码块保留英文原样

今天我很高兴为大家呈现上周客座文章的后续内容。Gwynne Raskind 再次回归，完成她对一个小型示例程序所生成汇编代码的深度分析。

在上周的文章中，我探讨了 x86_64 架构以及对 Mike 示例代码中 main 函数的反汇编。这是第二部分，我将在此探讨优化代码的差异、示例代码其余部分的反汇编、运行时启动函数（start runtime function），以及一些处理浮点数值的函数。如果你还没有阅读第一部分，我强烈建议你先读一读，否则第二部分的内容将很难理解。

优化在第一部分中，我特意考察了编译器生成的未经优化的汇编语言版本，其依据是，优化可能会掩盖代码在汇编器层面运作方式的细微差别。现在是时候看看优化后的代码是什么样子了。以下是 main 函数的汇编代码再次呈现，这次是用 -Os 选项编译的：

1
    _main:
2
        pushq   %rbp
3
        movq    %rsp, %rbp
4
        pushq   %r15
5
        pushq   %r14
6
        pushq   %r12
7
        pushq   %rbx
8
        callq   _objc_autoreleasePoolPush
9
        movq    %rax, %r14
10
        movq    L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rdi
11
        leaq    l_objc_msgSend_fixup_alloc(%rip), %rsi
12
        callq   *l_objc_msgSend_fixup_alloc(%rip)
13
        movq    L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
14
        leaq    L__unnamed_cfstring_26(%rip), %rdx
15
        movq    _objc_msgSend@GOTPCREL(%rip), %rbx
16
        movq    %rax, %rdi
17
        movl    $42, %ecx
18
        callq   *%rbx
19
        movq    %rax, %r15
20
        movq    L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
21
        movq    %r15, %rdi
22
        callq   *%rbx
23
        movq    %rax, %rdi
24
        callq   _objc_retainAutoreleasedReturnValue
25
        movq    %rax, %rbx
26
        movq    %rbx, %rdi
27
        callq   _MyFunction
28
        movq    %rax, %rdi
29
        callq   _objc_retainAutoreleasedReturnValue
30
        movq    %rax, %r12
31
        movq    %rbx, %rdi
32
        callq   _objc_release
33
        leaq    L__unnamed_cfstring_23(%rip), %rdi
34
        movq    %r12, %rsi
35
        xorb    %al, %al
36
        callq   _NSLog
37
        movq    %r12, %rdi
38
        callq   _objc_release
39
        movq    %r15, %rdi
40
        callq   _objc_release
41
        movq    %r14, %rdi
42
        callq   _objc_autoreleasePoolPop
43
        xorl    %eax, %eax
44
        popq    %rbx
45
        popq    %r12
46
        popq    %r14
47
        popq    %r15
48
        popq    %rbp
49
        ret

未优化的 main 函数有 60 行；这段优化后的代码只有 49 行。编译器成功节省了 11 条指令。期望获得更多优化是不现实的；优化 —— 即使是为了节省空间而进行的 —— 往往更关注于让 CPU 及其能力得到高效利用，而非使用绝对最少的指令数量。在几乎任何现代处理器上，使用几条额外的简单指令相比使用更少但更复杂的指令，收益都要大得多。使用 -O3（速度优化程度高过空间优化）编译实际上会将代码大小增加到 65 条指令，这主要是由于函数内联。

由于我已经解释了所涉及的所有单条指令的含义（除了一处例外），在这个分析中我将纯粹着眼于指令组，看看编译器如何优化每个部分。

pushq %rbp movq %rsp, %rbp 看起来很熟悉？应该如此；这和 main 函数最开始时使用的指令序列完全一样。设置栈指针的代码没有任何改变；栈帧（stack frame）必须以特定的方式建立，而这就是它（稍后会详述）。

1
    pushq %rbp
2
    movq %rsp, %rbp

看起来眼熟吗？应该眼熟；这正是main函数开头的完全相同的指令序列。设置栈指针的代码没有任何改变；栈帧（stack frame）必须以特定方式建立，而这里正是如此（稍后详述）。

pushq % r15 pushq % r14 pushq % r12 pushq % rbx 优化器没有选择将一堆值存储到栈上，而是选择将多个寄存器（registers）的值保存到栈中，以便在函数执行期间将这些寄存器用作临时空间。x86_64 ABI（应用二进制接口）规定了哪些寄存器在函数调用间得以保留，哪些可以自由用作临时寄存器，而此处使用的这些寄存器并非可自由使用的。由于在某些情况下，寄存器的速度可能比栈快上千倍 —— 实际上，如果栈恰好被换出（paged out）到磁盘，延迟甚至可能长达数秒！—— 因此在函数开始和结束时各使用一次栈，并在函数执行期间在寄存器中操作数据，这必定能带来性能提升。

1
    pushq   %r15
2
    pushq   %r14
3
    pushq   %r12
4
    pushq   %rbx

优化器没有将一堆值存储到栈上，而是选择将多个寄存器（register）的值保存到栈上，以便在函数执行期间将这些寄存器用作临时空间。x86_64 ABI（应用二进制接口）规定了哪些寄存器在函数调用时需要被保留，哪些可以自由用作临时空间，而这些寄存器都不能自由使用。由于在某些情况下，寄存器的速度可能比栈快成千上万倍 —— 实际上，如果栈恰好被换出到磁盘，延迟甚至可能延长到秒级！—— 那么在函数开始和结束时各使用一次栈，在函数执行期间只在寄存器中操作数据，这无疑是一种优化。

callq _objc_autoreleasePoolPush
movq %rax, %r14

_objc_autoreleasePoolPush 不接受参数，并在 rax 寄存器中返回一个简单的整数值。优化器将这个返回值保存在 r14 寄存器中，而不是将其溢出（spill）到栈上。

1
    callq   _objc_autoreleasePoolPush
2
    movq    %rax, %r14

objc_autoreleasePoolPush 函数不接收任何参数，并在 rax 寄存器中返回一个简单的整数值。优化器将该返回值保存在 r14 寄存器中，而非将其溢出到栈上。

movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rdi leaq l_objc_msgSend_fixup_alloc(%rip), %rsi callq *l_objc_msgSend_fixup_alloc(%rip)
将 MyClass 的类对象加载到 rdi 寄存器，将 l_objc_msgSend_fixup_alloc 的地址加载到 rsi 寄存器，然后调用该函数。这与未优化代码的序列大致相同，但避免了栈的使用，并且整个过程整合在一个地方完成。

1
    movq    L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rdi
2
    leaq    l_objc_msgSend_fixup_alloc(%rip), %rsi
3
    callq   *l_objc_msgSend_fixup_alloc(%rip)

将 MyClass 类对象加载到 rdi 中，将 l_objc_msgSend_fixup_alloc 的地址加载到 rsi 中，然后调用该函数。这与未优化的代码流程基本相同，但不使用栈，且全部集中于一处完成。

1
movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
2
leaq L__unnamed_cfstring_26(%rip), %rdx
3
movq _objc_msgSend@GOTPCREL(%rip), %rbx
4
movq %rax, %rdi
5
movl $42, %ecx
6
callq *%rbx
7
movq %rax, %r15

将 [MyClass initWithName:number:] 选择子（selector）加载到 rsi 中，将 @"name" 加载到 rdx 中，将 objc_msgSend@GOTPCREL 的地址加载到 rbx 中，将 alloc 的返回值加载到 rdi 中，将 42 加载到 ecx 中，调用 objc_msgSend@GOTPCREL，并将返回值（即 obj）保存到 r15 中。

objc_msgSend@GOTPCREL？这到底是什么东西？嗯，事实证明，它比表面看起来更复杂。如果你用反汇编器查看生成的机器码，会发现它实际上根本不是一条 mov 指令，而是一条 lea 指令！GOTPCREL 是一个指令，它允许在链接时插入函数的 rip 相对地址，从而在链接时可以计算出该地址的情况下进行直接调用。objc_msgSend 就是这样一个函数，而优化使得编译器会尝试这么做。换句话说，当优化开启时，编译器生成的代码会对函数进行简短、快速的调用，而不是让它经过较慢的动态库调用 —— 这可能是一个「远跳转」（跨越很长代码距离的分支，这种跳转必然慢得多）。

（译注：关于 @GOTPCREL 的具体机制，现代编译器和链接器的实现可能已有变化。）

注：我对这一点的事实并非 100% 确定；如果有人对 @GOTPCREL 的细节有任何见解，我将不胜感激。

1
    movq    L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
2
    leaq    L__unnamed_cfstring_26(%rip), %rdx
3
    movq    _objc_msgSend@GOTPCREL(%rip), %rbx
4
    movq    %rax, %rdi
5
    movl    $42, %ecx
6
    callq   *%rbx
7
    movq    %rax, %r15

将 [MyClass initWithName:number:] 选择子（selector）加载到 rsi 寄存器，将 @"name" 加载到 rdx，将 objc_msgSend@GOTPCREL 的地址加载到 rbx，将 alloc 的返回值加载到 rdi，将 42 加载到 ecx，然后调用 objc_msgSend@GOTPCREL，最后将返回值（即 obj）保存到 r15。

objc_msgSend@GOTPCREL？这到底是什么东西？嗯，事实证明，它比表面看到的要复杂。如果你用反汇编器查看生成的机器代码，就会发现它根本不是 mov 指令，而是一条 lea 指令！GOTPCREL 是一种指令，允许在链接时插入一个函数的相对指令指针（rip）相对地址，以便在链接时能计算出该地址的情况下进行直接调用。objc_msgSend 就是适用这种情况的函数之一，而优化机制使得编译器可以尝试这样做。

换句话说，当优化开启时，编译器会生成一段代码，对函数进行简短、快速的调用，而不是让它走更慢的动态库调用路径，后者可能涉及一次 “远距离跳转”（far jump）（即跳转到代码中相距很远的位置，这种跳转必然要慢得多）。

（译注：关于 @GOTPCREL 的具体细节，此处作者表示不完全确定其技术细节，并欢迎读者提供更深入的见解。）

movq L_OBJC_SELECTOR_REFERENCES_28 (% rip), % rsi movq % r15, % rdi callq *% rbx 将 name selector（选择子）加载到 rsi，将 obj 从 r15 加载到 rdi，然后再次调用 objc_msgSend（消息发送函数）。这是一个优化真正开始展现其用处的案例。这个调用的未优化版本在栈和其他寄存器之间保存和加载数据，实际上为第二次消息发送重新执行了整个设置过程。优化器认识到额外的数据复制是冗余的，因此直接加载所有内容 —— 更重要的是，避免将已经在寄存器中的数据再次加载进去。

1
    movq    L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
2
    movq    %r15, %rdi
3
    callq   *%rbx

将 name selector（选择子）载入 rsi，将 r15 中的 obj 载入 rdi，然后再次调用 objc_msgSend。这是优化真正开始显现其价值的地方。该调用的未优化版本会通过栈和其他寄存器保存和加载数据，实际上等于重复了整个第二次消息发送的准备工作。优化器识别出这些额外的数据拷贝是冗余的，于是直接加载所有内容 —— 更重要的是，避免将已经在寄存器中的数据再次载入其中。

1
movq %rax, %rdi
2
callq _objc_retainAutoreleasedReturnValue

获取上一次消息发送的返回值，并立即将其传递给 objc_retainAutoreleasedReturnValue。这与未优化代码中的序列相同。事实上，在 Objective-C 运行时（runtime）中，某些操作的处理方式正是特别依赖于这两条指令的存在而有所不同。

1
    movq    %rax, %rdi
2
    callq   _objc_retainAutoreleasedReturnValue

从最后一次消息发送中获取返回值，并立即传递给 objc_retainAutoreleasedReturnValue。这与未优化代码的序列相同。事实上，在 Objective-C 运行时（runtime）中，某些操作正是基于这两条指令的存在而有所不同地运作。

1
movq %rax, %rbx
2
movq %rbx, %rdi
3
callq _MyFunction
4
movq %rax, %rdi
5
callq _objc_retainAutoreleasedReturnValue
6
movq %rax, %r12

调用 MyFunction(name)，保留其返回值，并将结果保存在 r12 中。对 rbx 的额外存储看起来是冗余的，但它并非如此，我们将在进一步下方看到原因。

1
    movq    %rax, %rbx
2
    movq    %rbx, %rdi
3
    callq   _MyFunction
4
    movq    %rax, %rdi
5
    callq   _objc_retainAutoreleasedReturnValue
6
    movq    %rax, %r12

调用MyFunction(name)，保留其返回值，并将结果保存到r12中。额外对rbx的存储看似冗余，但实际并非如此，我们稍后会看到原因。

movq %rbx, %rdi
callq _objc_release
明白了吗？自[MyClass name]的返回值被保存到rbx后，rax和rdi都已被重新使用。所以这根本不是冗余操作！
“但为什么编译器不一开始就把它留在rbx里呢？“
记住，函数的第一个参数必须放在rdi中。这个值必须存放在某个不会被紧随其后的操作覆盖的位置。

1
    movq    %rbx, %rdi
2
    callq   _objc_release

明白了吗？rax 和 rdi 寄存器在 [MyClass name] 的返回值保存到 rbx 之后，实际上已经被重新使用了。所以这并非多余操作！

“但编译器为什么不一开始就把它留在 rbx 里？” 要记住，函数的第一个参数必须存放在 rdi 中。这个值必须保存在一个不会被接下来立刻执行的操作覆盖的地方。

leaq L__unnamed_cfstring_23(%rip), %rdi
movq %r12, %rsi
xorb %al, %al
callq _NSLog

调用 NSLog(@"%@", MyFunction 的返回值)，未使用任何向量寄存器（vector registers）—— 要记住，可变参数函数（variadic functions）要求将用作参数的向量寄存器数量存放在 al 中。此处并无特殊之处。

1
    leaq    L__unnamed_cfstring_23(%rip), %rdi
2
    movq    %r12, %rsi
3
    xorb    %al, %al
4
    callq   _NSLog

调用 NSLog(@"%@", return value of MyFunction) 且未使用任何向量寄存器 - 记住可变参数函数要求用于参数的向量寄存器使用数量必须存放在 al 寄存器中。此处并无特殊之处。

movq %r12, %rdi
callq _objc_release
movq %r15, %rdi
callq _objc_release
释放两个已不再使用的对象（MyFunction 的返回值与 obj）。严格来说，在调用 NSLog 时 obj 已不使用，但 ARC 的代码流分析并非如此激进；释放操作是在其所在作用域的末尾执行，而非在值不再被使用的瞬间。注：对 [MyClass name] 的返回，其有效作用域实际上是 MyFunction 调用本身；该返回值从未被赋值给变量（特指 __strong 类型的变量），因此不被视为在函数调用后仍然可能 “存活”。

1
    movq    %r12, %rdi
2
    callq   _objc_release
3
    movq    %r15, %rdi
4
    callq   _objc_release

释放两个不再使用的对象（MyFunction 的返回值和 obj）。从技术上讲，obj 在 NSLog 时已经未被使用，但 ARC 的代码流分析并不那么激进；释放操作是在封闭作用域（enclosing scope）结束时执行的，而不是在值不再被使用的瞬间。

注意：[MyClass name] 的返回值其有效封闭作用域是 MyFunction 调用本身；它从未被赋值给变量（具体来说，是赋给一个 __strong 变量），因此在函数调用之后不被认为是潜在 “活跃” 的。

1
- movq %r14, %rdi
2
callq _objc_autoreleasePoolPop
3
xorl %eax, %eax
4
popq %rbx
5
popq %r12
6
popq %r14
7
popq %r15
8
popq %rbp
9
ret

弹出自动释放池（autorelease pool），将 eax 置零作为 main 的返回值，恢复保存的寄存器，然后返回。

1
    movq    %r14, %rdi
2
    callq   _objc_autoreleasePoolPop
3
    xorl    %eax, %eax
4
    popq    %rbx
5
    popq    %r12
6
    popq    %r14
7
    popq    %r15
8
    popq    %rbp
9
    ret

弹出自动释放池，将 eax 寄存器设置为 main 函数的返回值，恢复保存的寄存器，然后返回。

这便是优化后的 main 函数。此处可见优化的主要效果是寄存器的利用率大幅提升；除了保存寄存器外，完全没有使用栈空间，并且没有任何冗余或无用的数据拷贝。

你认为自己能比编译器做得更好吗？其他优化机会可能存在，但那些看起来显而易见的机会，实际上大多被 CPU、ABI（应用二进制接口）、以及 Objective-C 和 ARC（自动引用计数）的工作方式所禁止。

提示：对 rbp 的压栈和出栈操作，以及将 rsp 复制到 rbp，这些操作是不必要的，因为优化已经移除了函数体内对 rbp 的所有引用！没有这三条指令，main 函数仍然能工作，但调试器可能就不会！调试器在某些情况下依赖栈帧（stack frame）的存在，栈帧包含一个正确初始化的基址指针寄存器（base pointer register）和栈上保存的基址指针值。某些其他系统功能也可能依赖栈帧的存在，尽管在正常使用中很少出现。在 OS X 上，默认情况下，即使是在高优化级别，告诉 GCC 和 Clang 跳过使用栈帧的开关也是禁用的，这暗示着有人认为，为每个函数节省三条指令并不值得。可能确实不值得。例如，系统框架的构建都保留了完整的栈帧。通常，除非有充分理由，否则应该始终包含栈帧。

The MyFunction Function 接下来，我们来看 MyFunction 函数：

1
    NSString *MyFunction(NSString *parameter)
2
    {
3
        NSString *string2 = [@"Prefix" stringByAppendingString: parameter];
4
        NSLog(@"%@", string2);
5
        return string2;
6
    }

我准备倒序剖析这个函数。与其直接查看编译器生成的汇编代码，不如运用我们从 main 函数中学到的编译器工作原理来手动构建它。毕竟这个函数执行的操作在 main 函数中都出现过。作为额外补充，我们甚至会插入必要的 ARC（自动引用计数）调用。

函数序言： _MyFunction: pushq %rbp movq %rsp, %rbp 每个 C 函数都有一个序言部分。请参阅前面关于栈帧（stack frame）的讨论。这是为我们的新函数设置的栈帧，为完整起见也标注了其标签。按照语言惯例，所有 C 函数名在汇编阶段都会前置一个下划线。查看任何可执行文件或库的符号表会发现，几乎所有符号前面都至少有一个下划线。

1
    _MyFunction:
2
        pushq %rbp
3
        movq %rsp, %rbp

每个 C 函数都有一个序言（prologue）。参考上面关于栈帧（stack frame）的讨论。为了完整性起见，这是我们新函数的栈帧及其标签。作为语言惯例，所有 C 函数名在汇编阶段都会在前面添加一个下划线。查看任何可执行文件或库的名称表会发现，几乎所有符号都至少有一个前导下划线。

保存寄存器（Save registers）：pushq %rbx。此函数只需要一个暂存寄存器（scratch register），因此我们使用 rbx。

1
    pushq %rbx

我们只需要一个临时寄存器（scratch register）来实现这个函数，那么就选用 rbx 吧。

调用 stringByAppendingString:
movq %rdi, %rdx
leaq L_prefix_string_reference(%rip), %rdi
movq L_stringByAppendingString__selector_reference(%rip), %rsi
callq *_objc_msgSend@GOTPCREL(%rip)
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue

首先，我假设字符串 @"Prefix" 存在于某处，这里使用了标签 L_prefix_string_reference（这个标签名是我随意编造的）。标签名称是任意的；编译器生成的那些看起来很正式的名称实际上都是自动生成的。甚至在前面加上 L_ 也只是我选择遵循的一种约定，使其看起来更接近编译器的版本。同样，我假设 L_stringByAppendingString__selector_reference 指向了对应的选择子（selector）名称。

接下来，我将 rdi 寄存器的值移动到 rdx。由于参数 parameter 作为 MyFunction 的第一个参数存放在 rdi 中，我现在已经将它变成了即将调用的函数的第三个参数。然后，我将 @"Prefix" 加载为第一个参数，将 -stringByAppendingString: 选择子加载为第二个参数，并调用了基于 RIP 相对寻址的 objc_msgSend 版本。最后，我将返回值传递给 objc_retainAutoreleasedReturnValue，这是 ARC 的要求。ARC 仅在 Objective-C 编译器层面起作用；在汇编层面，必须手动调用它，这与普通的 retain-release 代码类似，但规则更为严格。

调用 stringByAppendingString:

1
    movq %rdi, %rdx
2
    leaq L_prefix_string_reference(%rip), %rdi
3
    movq L_stringByAppendingString__selector_reference(%rip), %rsi
4
    callq *_objc_msgSend@GOTPCREL(%rip)
5
    movq %rax, %rdi
6
    callq _objc_retainAutoreleasedReturnValue

首先，我假设字符串 @“Prefix” 出现在标签 L_prefix_string_reference 所指示的地方，这个标签是我编造的。标签名称是任意的；编译器那些看起来很官方的名称只是自动生成的。即使在前面加上 L_ 也只是我选择遵循的一个约定，以使其看起来更像编译器的版本。同样地，我假设 L_stringByAppendingString__selector_reference 指向相应的 selector（选择子）名称。从那里，我将 rdi 移动到 rdx。由于 parameter 是 MyFunction 的第一个参数，它位于 rdi 中，现在我将其设置为即将调用的函数的第三个参数。我将 @“prefix” 加载为第一个参数，将 -stringByAppendingString: selector（选择子）作为第二个参数，然后调用 objc_msgSend（消息发送函数）的 rip-relative 版本。最后，我获取返回值并将其传递给 objc_retainAutoreleasedReturnValue（自动引用计数相关函数），根据 ARC（自动引用计数）的要求。ARC 仅在 Objective-C 编译器层面起作用；在汇编中，必须手动调用，就像普通的 retain-release（引用计数管理）代码一样，但规则更严格。

调用 NSLog：
movq %rax, %rsi
leaq L_format_string_reference(%rip), %rdi
xorb %al, %al
callq _NSLog

我现在就告诉你，这段代码在一个重要方面是错误的：由于我知道后续需要用到 -stringByAppendingString: 的返回值，我错误地假设调用 NSLog 不会改变 rax 和 rsi 寄存器。然而 x86_64 ABI（应用程序二进制接口）明确规定，这两个寄存器在函数调用时不会被保留。在这段代码执行过程中，我们已经多次在未保存它们的情况下覆盖了它们，因此很难指望 NSLog 不会做同样的事。（不仅如此，代码本身在调用序列中就将 rax 的低字节清零了！）这段代码执行前 rax 和 rsi 中的值必须被保存，否则会在调用中丢失。
注意：即使 NSLog 碰巧保留了 rsi，调用代码也不能安全地做出这种假设。唯一能假设函数在 ABI 规范外保留寄存器的情况是：当你用汇编语言亲自编写了该函数的每一行代码，并且已记录下该要求以便日后不会违反它。
解决方案是将第一条movq替换为以下两行：
movq %rax, %rbx
movq %rbx, %rsi
这个值（在原始 Objective-C 源码中称为 string2）现在被保存在 rbx 中供我们使用。这就是我在函数开头保存 rbx 的原因。（译注：现代 x86_64 调用约定中，rbx 属于被调用者保存寄存器，此处用法符合规范）

调用 NSLog：

1
    movq %rax, %rsi
2
    leaq L_format_string_reference(%rip), %rdi
3
    xorb %al, %al
4
    callq _NSLog

我得直说这段代码有个重大错误：因为我明知后面需要用到-stringByAppendingString:的返回值，所以错误地假设了rax和rsi在调用 NSLog 时不会被改动。但 x86_64 ABI（应用程序二进制接口）明确规定这两个寄存器不会跨函数调用被保留。在这段代码执行过程中，我们已经多次在未保存寄存器的情况下破坏了它们，很难指望 NSLog 会例外。（不仅如此，代码本身在调用序列中还将rax的低字节清零了！）此段代码执行前rax和rsi中的值必须被保留，否则将在调用过程中丢失。注意：即使 NSLog 恰好保留了rsi，调用方也不能安全地做出这种假设。唯一能假设某个函数超出 ABI 规范保留寄存器的情况是：当你亲自用汇编语言编写了该函数的每一行代码，并且已明确记录此要求以确保后续不会违反。解决方案是将第一条movq指令替换为以下两行：

1
    movq %rax, %rbx
2
    movq %rbx, %rsi

该值（在原始 Objective-C 源代码中被称为 string2）现在被保存在 rbx 中，以便我们可以使用它。这就是为什么我在函数开头保存了 rbx。

从函数返回：movq %rbx, %rdi popq %rbx popq %rbp jmp _objc_autoreleaseReturnValue ## 尾调用（Tail Call）

哇，等等，这都是什么？什么是尾调用？在 ARC 模式下，从一个未标注为cf/ns_returns_retained的函数返回的对象，必须被传递给objc_autoreleaseReturnValue。因此，这必须是函数在返回前做的最后一件事。“那么，” 你可能会问，“为什么不执行movq %rbx, %rdi，然后callq _objc_autoreleaseReturnValue，让rax保留返回值，同时你再执行popq和ret呢？” 答案是：因为这很低效。当一个函数做的最后一件事是调用另一个函数并返回其类型相同的结果时，就可以使用尾调用来节省时间、空间和精力。

在第一条movq指令执行时，栈的状态大概是这样的：

1
+----------------+
2
| RETURN ADDRESS | 16 <--- main函数中的下一条指令，由 `callq _MyFunction` 压入
3
| Saved %rbp     | 8  <--- 保存的rbp值，由函数序言压入
4
| Saved %rbx     | 0  <--- 保存的rbx值，由我们的代码压入
5
+----------------+

如果我只是简单地callq _objc_autoreleaseReturnValue，那么栈会变成这样：

1
+----------------+
2
| RETURN ADDRESS | 24 <--- main函数中的下一条指令，由 `callq _MyFunction` 压入
3
| Saved %rbp     | 16 <--- 保存的rbp值，由函数序言压入
4
| Saved %rbx     | 8  <--- 保存的rbx值，由我们的代码压入
5
+----------------+
6
| RETURN ADDRESS | 0  <---- MyFunction中的下一条指令，由 `callq _objc_autoreleaseReturnValue` 压入
7
+----------------+

当objc_autoreleaseReturnValue返回时，栈会通过ret指令弹出而回到原来的状态，然后同样的过程会立即再次发生。如果objc_autoreleaseReturnValue能够直接返回到main，会不会更高效呢？因为MyFunction已经绝对无事可做了。这就是尾调用的作用。MyFunction没有使用call（它会往栈上压入一个新的返回地址），而是将栈恢复到只有main的返回地址的状态，然后直接跳转到objc_autoreleaseReturnValue。栈最终会变成这样：

1
+----------------+
2
| RETURN ADDRESS | 0  <---- main函数中的下一条指令，由 `callq _MyFunction` 压入！
3
+----------------+

现在，当objc_autoreleaseReturnValue中的ret指令将栈上的返回地址弹出到rip时，它会直接跳转回main，而rax中正确地包含着返回值。我们节省了一次压栈、一次弹栈，以及（不那么明显地）CPU 的一些额外工作。如果目标函数恰好位于内存中邻近的位置，jmp指令在大小上也可能比callq更小。

从汇编语言的角度看，尾调用可能像是一个微小的优化，但节省一个完整的额外栈帧（Stack Frame）可以决定递归算法（Recursive Algorithm）的成败。此外，objc_msgSend的设计从根本上就围绕着尾调用的使用；没有它们，Cocoa 程序的性能可能会慢一个数量级左右，而且你能想象在调试器里加载一个程序，看到回溯信息中objc_msgSend出现在每一个方法调用之前吗？

从函数返回：

1
    movq %rbx, %rdi
2
    popq %rbx
3
    popq %rbp
4
    jmp _objc_autoreleaseReturnValue ## TAIL CALL

哇，等等，这都是些什么？什么是尾调用（tail call）？

在 ARC 模式下，一个从函数返回的对象，如果该函数没有被标记为 cf/ns_returns_retained，就必须传递给 objc_autoreleaseReturnValue。因此，这必须是函数在返回前做的最后一件事。

“所以，“你可能会问，“为什么不在执行 movq %rbx, %rdi 之后，调用 callq _objc_autoreleaseReturnValue，然后在 popq 和 ret 期间让 rax 寄存器保留那个返回值呢？” 答案是：因为这效率低下。当一个函数做的最后一件事是返回另一个函数的、类型完全相同的结果时，可以使用尾调用来节省时间、空间和资源。在第一条 movq 指令执行时，栈的状态大致如下：

1
    +----------------+
2
    | RETURN ADDRESS | 16 <--- next instruction in main, pushed by `callq _MyFunction`
3
    |   Saved %rbp   | 8  <--- saved value of rbp, pushed by prologue
4
    |   Saved %rbx   | 0  <--- saved value of rbx, pushed by our code
5
    +----------------+

如果我直接调用 _objc_autoreleaseReturnValue，那么栈的状态就会变成这样：

1
    +----------------+
2
    | RETURN ADDRESS | 24 <--- next instruction in main, pushed by `callq _MyFunction`
3
    |   Saved %rbp   | 16 <--- saved value of rbp, pushed by prologue
4
    |   Saved %rbx   | 8  <--- saved value of rbx, pushed by our code
5
    +----------------+
6
    | RETURN ADDRESS | 0  <---- next instruction in MyFunction, pushed by `callq _objc_autoreleaseReturnValue`
7
    +----------------+

当 objc_autoreleaseReturnValue 返回时，堆栈会通过 ret 指令弹出并精确回到原位置，随后相同的操作会立即再次发生。既然 MyFunction 已经完全没有剩余操作，如果 objc_autoreleaseReturnValue 能够直接返回到 main 函数，效率会不会更高？

这正是尾调用（tail call）的作用。MyFunction 不使用 call 指令（该指令会向堆栈压入新的返回地址），而是将堆栈恢复为仅包含 main 的返回地址状态，然后直接跳转到 objc_autoreleaseReturnValue。此时堆栈最终状态如下：

1
    +----------------+
2
    | RETURN ADDRESS | 0  <---- next instruction in main, pushed by `callq _MyFunction`!
3
    +----------------+

现在，当 objc_autoreleaseReturnValue 中的 ret 指令将返回地址从栈中弹出到 rip 寄存器时，程序会直接跳回 main 函数，而 rax 寄存器中正准确地包含着返回值。我们节省了一次 push 操作、一次 pop 操作，以及（较不明显的）CPU 的某些隐藏开销。如果目标函数在内存中恰好位于附近，jmp 指令也可能比 callq 指令更小。

从汇编语言的角度看，尾调用（tail call）可能似乎是一种微小的优化，但节省整个额外的栈帧（stack frame）可以成就或破坏递归算法。此外，objc_msgSend 从根本上就是围绕尾调用的使用而设计的；没有它们，Cocoa 程序的运行速度可能会慢上一个数量级 —— 你能想象在调试器中加载程序时，在回溯信息（backtrace）的每一次方法调用前都看到 objc_msgSend 吗？

如果你查看 Clang 版本的汇编代码，它几乎与我们的一模一样！有三个例外：

当然，Clang 对字符串和选择子（selector）引用的命名方式不同。
Clang 以略微不同的顺序移动参数；这对代码的执行没有影响。

出于不明显的原因，Clang 将 rax 的值保存到栈上，却在函数尾声（function epilogue）中完全忽略该值。实际上，Clang 正在将栈对齐（stack alignment）到 16 字节边界，这是 SSE 指令（流式单指令多数据扩展指令）和 Cocoa 通常的要求。这导致函数在栈上总共有 32 字节（16 的偶数倍）：main 的返回地址（return address）、保存的 rbp、保存的 rbx 和保存的 rax。栈对齐的要求足以克服节省指令的愿望；没有该对齐，代码将不正确，并且很可能在下次调用 objc_msgSend 时崩溃。

那么，这里是我们编写的函数的最终版本，包括一个对齐的栈：

这里是我们编写的整个列表作为一个整体：

1
    _MyFunction:
2
        pushq %rbp
3
        movq %rsp, %rbp
4
        pushq %rbx
5
        pushq %rax
6
        movq %rdi, %rdx
7
        leaq L_prefix_string_reference(%rip), %rdi
8
        movq L_stringByAppendingString__selector_reference(%rip), %rsi
9
        callq *_objc_msgSend@GOTPCREL(%rip)
10
        movq %rax, %rdi
11
        callq _objc_retainAutoreleasedReturnValue
12
        movq %rax, %rsi
13
        leaq L_format_string_reference(%rip), %rdi
14
        xorb %al, %al
15
        callq _NSLog
16
        movq %rax, %rbx
17
        movq %rbx, %rsi
18
        movq %rbx, %rdi
19
        addq $8, %rsp # ignore the saved rax
20
        popq %rbx
21
        popq %rbp
22
        jmp _objc_autoreleaseReturnValue ## TAIL CALL

简单浮点数处理

接下来，我将通过一个新函数作为处理非整数值（non-integer values）的简单示例。以下是它的 Objective-C 版本：

1
    float MyFPFunction(float parameter)
2
    {
3
        float x = parameter + 0.5;
4

5
        x -= 0.3f;
6
        return x;
7
    }

我调用它的那一行：

1
    NSLog(@"%f", MyFPFunction(1.0));

以下是 Clang 生成的汇编代码：

1
    LCPI7_0:
2
        .long   1056964608              ## float 5.000000e-01
3
    LCPI7_1:
4
        .long   3197737370              ## float -3.000000e-01
5
    _MyFPFunction:                          ## @MyFPFunction
6
        pushq   %rbp
7
        movq    %rsp, %rbp
8
        addss   LCPI7_0(%rip), %xmm0
9
        addss   LCPI7_1(%rip), %xmm0
10
        popq    %rbp
11
        ret

（我省略了实际函数调用的汇编代码，因为在优化编译下，除非函数被内联，否则要让 Clang 真正生成这种汇编极其困难，而非优化版本又有所不同。无论如何，其中唯一值得注意的细节是为 NSLog 调用将 al 设置为 1，因为它使用了向量寄存器（vector register）。）

该函数极其简单：

首先是一段标准序言。
然后，由于 ABI（应用程序二进制接口）规定第一个浮点值通过第一个向量寄存器（vector register）xmm0 传递，函数直接操作该寄存器。addss 指令，简单来说，用于添加两个浮点值（“add signed single-precision”，即加法有符号单精度）。代码中的常量 0.5 和 - 0.3（减去 0.3 等同于加上 - 0.3）作为数据存储在可执行文件中，因为汇编语言和实际机器码都没有表达浮点立即数的方式。这些值本身以 IEEE-754 单精度数（IEEE-754 single-precision numbers）存储。恰好浮点返回值也存储在第一个向量寄存器中，因此通过直接操作 xmm0，函数已经完成了所有需要做的工作。
最后，一个标准的函数尾声（function epilogue）。

这不是很简单吗？事实证明，使用浮点值只需要切换到 128 位向量寄存器（128-bit vector registers）和 SSE1 指令集（SSE1 instruction set）。旧的 mmx 和 st (n) 寄存器，以及 x87 指令集，已经过时了。（译注：这些寄存器和指令集在现代系统中已被取代）与 SSE1 操作相比，它们效率低下。

C 运行时环境
当你启动程序时，有一些幕后操作正在发生。你是否知道 main 并不是系统调用的第一个函数？

没错！一旦 dyld 完成了进程内存空间的设置，它会跳转到一个名为 start 的标准入口点函数。这个函数是原封不动地从 C 运行时库（libcrt）复制到你的可执行文件中的。它完全由纯汇编编写，不会出现在 Clang 的汇编输出中，因为它直到链接完成后才存在于你的程序里。下面让我们看看它的代码。我借用了苹果网站上的源代码。根据该代码所遵循的 APSL 许可条款，我在代码清单中包含了 APSL 许可证头。

dyld 会读取你二进制文件中的 LC_UNIXTHREAD 加载命令，并据此为新进程设置 CPU 状态。快速浏览一下 otool -l 的输出，我们会发现 rip 寄存器被初始化为了二进制映像中 start 符号的加载地址！很巧妙，对吧？

start 函数由以下代码构成：

1
    /*
2
     * Copyright (c) 1999-2008 Apple Inc. All rights reserved.
3
     *
4
     * @APPLE_LICENSE_HEADER_START@
5
     *
6
     * Portions Copyright (c) 1999 Apple Computer, Inc.  All Rights
7
     * Reserved.  This file contains Original Code and/or Modifications of
8
     * Original Code as defined in and that are subject to the Apple Public
9
     * Source License Version 1.1 (the "License").  You may not use this file
10
     * except in compliance with the License.  Please obtain a copy of the
11
     * License at http://www.apple.com/publicsource and read it before using
12
     * this file.
13
     *
14
     * The Original Code and all software distributed under the License are
15
     * distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, EITHER
16
     * EXPRESS OR IMPLIED, AND APPLE HEREBY DISCLAIMS ALL SUCH WARRANTIES,
17
     * INCLUDING WITHOUT LIMITATION, ANY WARRANTIES OF MERCHANTABILITY,
18
     * FITNESS FOR A PARTICULAR PURPOSE OR NON- INFRINGEMENT.  Please see the
19
     * License for the specific language governing rights and limitations
20
     * under the License.
21
     *
22
     * @APPLE_LICENSE_HEADER_END@
23
     */
24
    start:  pushq   $0          # push a zero for debugger end of frames marker
25
            movq    %rsp,%rbp       # pointer to base of kernel frame
26
            andq    $-16,%rsp       # force SSE alignment
27
            movq    8(%rbp),%rdi        # put argc in %rdi
28
            leaq    16(%rbp),%rsi       # addr of arg[0], argv, into %rsi
29
            movl    %edi,%edx       # copy argc into %rdx
30
            addl    $1,%edx         # argc + 1 for zero word
31
            sall    $3,%edx         # * sizeof(char *)
32
            addq    %rsi,%rdx       # addr of env[0], envp, into %rdx
33
            movq    %rdx,%rcx
34
            jmp Lapple2
35
    Lapple: add $8,%rcx
36
    Lapple2:cmpq    $0,(%rcx)       # look for NULL ending env[] array
37
            jne Lapple
38
            add $8,%rcx         # once found, next pointer is "apple" parameter now in %rcx
39
            call    _main
40
            movl    %eax,%edi       # pass result from main() to exit()
41
            call    _exit           # need to use call to keep stack aligned
42
            hlt

start 不像 C 函数那样工作，因为它本身并非一个 C 函数。它专门用于从裸机的可执行状态转换到 C（和 Objective-C）能够运作的状态。甚至函数序言（function prologue）也是非标准的。

pushq $0 - 将零压入栈中。调试器用此作为「栈帧结束」的标记，以替代普通函数序言中的 pushq %rbp。
movq %rsp,%rbp - 抓住栈指针，因为在此函数中实际用到了栈。
andq $-16,%rsp - 将栈指针的最低四位掩码清零。这会将初始栈对齐到 16 字节边界，这是 SSE 指令和 Cocoa 框架的普遍要求。这很可能是一个实际无效的操作，因为系统通常会提供一个已正确对齐的栈，但 C 运行时不会也不能做此假设。
movq 8(%rbp),%rdi - 注释中提到的「内核帧」（kernel frame），是当 dyld 调用 start 时存在于栈上的数据。第一个（最顶部的）值是大家熟悉的 main 函数的 argc 参数。将其放入 rdi 寄存器，是为函数调用设置好第一个参数。
leaq 16(%rbp),%rsi - 栈上的第二个值是 argv，因此它现在成为了第二个函数参数。
movl %edi,%edx - 将 argc 的低 4 字节存入 rdx。
addl $1,%edx - 将 argc 的副本加 1。
sall $3,%edx - 将该值乘以 8（左移 3 位相当于乘以 8）。此时 edx 中包含 argv 数组的总字节大小。
addq %rsi,%rdx - 将 argv 的地址与计算出的大小相加，得到指向 argv 末尾的指针。为何这样做？在 OS X 上，作为第三个参数传递给 main 的较少使用的 envp 数组在内存中紧接在 argv 之后。现在，第三个函数参数就是 envp。
movq %rdx,%rcx - 接着将 envp 复制到第四个函数参数。
jmp Lapple2
Lapple: add $8,%rcx
Lapple2: cmpq $0,(%rcx) # 查找结束 env [] 数组的 NULL
jne Lapple
这四行代码构成了一个简单循环，它将 rcx 的值增加 8，直到它指向的内存位置包含零（NULL）。用 C 语言表述，这相当于 while (*((uint64_t *)rcx)++);。jne 指令意为 “如果不等则跳转”，或等价地 “如果 ZF（零标志位）为零则跳转”。ZF 由前一条 cmp 指令设置，该指令的含义是 “根据两个操作数相减的结果设置 rflags 标志位，但舍弃结果本身”。这个循环用于找到以 NULL 结尾的 envp 数组的末尾。

1
            jmp Lapple2
2
    Lapple: add $8,%rcx
3
    Lapple2:cmpq    $0,(%rcx)       # look for NULL ending env[] array
4
            jne Lapple

这四行代码构成了一个简单的循环，它将 rcx 的值递增 8，直到其指向的内存位置包含零为止。用 C 语言表述就是：while (*((uint64_t *)rcx)++);。jne 指令意为 “如果不相等则跳转”，或者说 “如果 ZF（零标志位）为零则跳转”。ZF 由前一条 cmp 指令设置，该指令的含义是 “根据两个操作数相减的结果设置 rflags（标志寄存器），但丢弃结果本身”。这个循环用于找到以 NULL 结尾的 envp 数组的末尾。

addq $8,%rcx — 跳过 envp 末尾之后的下一个指针，即 exec_path，它是 main 函数的第四个参数，尽管这个参数鲜为人知，使用得更少。
callq _main — 最终调用 main 函数本身。
movl %eax,%edi — 将 main 的 4 字节返回值加载为函数调用的第一个参数。
callq _exit — 调用 exit(2) 函数，并将 main 返回的值传递给它。exit(2) 永不返回，因此该指令之后不应再执行任何指令。
hlt - 以防万一执行流意外到达此处，用于「停止」（halt）CPU。如果由非内核代码执行，hlt 会触发特权违规异常（privilege violation exception），因此它适合作为「你不该到达的位置」的收尾。其效果等同于「不可达」（unreachable）。在非常古老的 x86 处理器上，应用程序会调用 hlt 来停止 CPU，但在现代计算机中还有其他需要妥善关闭的硬件，单单一条指令显然不足以完成这个任务。例如，它无法关闭电源。

结论无需再看示例代码反汇编的其余部分；其中没有我尚未在其他地方探讨过的内容。如果到现在你还无法自行理解它，那可能是我解释得不够清楚！因此，我特此宣告第二部分的结束。

自第一篇文章发布以来，我收到了多个请求，希望我能以 iPhone 及其他 i 设备使用的 ARM 架构为基础来解释这些概念。此前我尚未在此层面深入研究过 ARM 架构，但我始终乐于学习新事物。因此我已开始研习 ARM 架构，并将基于所学内容沿用相同的示例代码撰写本系列的第三篇文章。在此之前，祝好运，也希望你们一直喜欢我的工作！

#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2011-12-23-disassembling-the-assembly-part-2.html

Today I have the pleasure to present the followup to last week’s guest post. Gwynne Raskind returns to complete her in-depth analysis of the assembly code generated by a small sample program.

In last week’s article, I discussed the x86_64 architecture and the disassembly of the main function of Mike’s example code. This is part 2, in which I look at the differences in optimized code, disassembly of the rest of the sample code, the start runtime function, and some functions that work with floating-point values. If you haven’t yet read part 1, I strongly recommend it, since otherwise part 2 won’t make much sense.

OptimizationIn part 1, I purposely examined the unoptimized version of the assembly language produced by the compiler, under the theory that optimization would obscure the finer details of how the code works at the assembler level. Now it’s time to see what optimized code looks like. Here’s main in assembly again, this time compiled with -Os:

1
    _main:
2
        pushq   %rbp
3
        movq    %rsp, %rbp
4
        pushq   %r15
5
        pushq   %r14
6
        pushq   %r12
7
        pushq   %rbx
8
        callq   _objc_autoreleasePoolPush
9
        movq    %rax, %r14
10
        movq    L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rdi
11
        leaq    l_objc_msgSend_fixup_alloc(%rip), %rsi
12
        callq   *l_objc_msgSend_fixup_alloc(%rip)
13
        movq    L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
14
        leaq    L__unnamed_cfstring_26(%rip), %rdx
15
        movq    _objc_msgSend@GOTPCREL(%rip), %rbx
16
        movq    %rax, %rdi
17
        movl    $42, %ecx
18
        callq   *%rbx
19
        movq    %rax, %r15
20
        movq    L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
21
        movq    %r15, %rdi
22
        callq   *%rbx
23
        movq    %rax, %rdi
24
        callq   _objc_retainAutoreleasedReturnValue
25
        movq    %rax, %rbx
26
        movq    %rbx, %rdi
27
        callq   _MyFunction
28
        movq    %rax, %rdi
29
        callq   _objc_retainAutoreleasedReturnValue
30
        movq    %rax, %r12
31
        movq    %rbx, %rdi
32
        callq   _objc_release
33
        leaq    L__unnamed_cfstring_23(%rip), %rdi
34
        movq    %r12, %rsi
35
        xorb    %al, %al
36
        callq   _NSLog
37
        movq    %r12, %rdi
38
        callq   _objc_release
39
        movq    %r15, %rdi
40
        callq   _objc_release
41
        movq    %r14, %rdi
42
        callq   _objc_autoreleasePoolPop
43
        xorl    %eax, %eax
44
        popq    %rbx
45
        popq    %r12
46
        popq    %r14
47
        popq    %r15
48
        popq    %rbp
49
        ret

The unoptimized version of main was 60 lines; this optimized code is only 49. The compiler managed to save 11 instructions. Expecting more is unreasonable; optimization, even when done for size savings, tends to be more concerned with making efficient use of the CPU and its abilities than using the absolute minimum number of instructions. On almost any modern processor, there is hugely more benefit in using a few extra simple instructions versus fewer instructions that are more complicated. Compiling with -O3, which optimizes heavily for speed over size, actually increases the code size to 65 instructions, mostly due to inlining.

Because I’ve already explained the meaning of all of the individual instructions involved (with one exception), in this breakdown I’ll look purely at groups of instructions and how the compiler has optimized each section.

pushq %rbp movq %rsp, %rbp Look familiar? It should; this is exactly the same instruction sequence main started with before. Nothing’s changed about the code which sets up the stack pointer; the stack frame has to be set up in a particular way and this is it (more on this later).

1
    pushq %rbp
2
    movq %rsp, %rbp

Look familiar? It should; this is exactly the same instruction sequence main started with before. Nothing’s changed about the code which sets up the stack pointer; the stack frame has to be set up in a particular way and this is it (more on this later).

pushq %r15 pushq %r14 pushq %r12 pushq %rbx Instead of a bunch of values being stored to the stack, the optimizer has chosen to save the values of several registers to the stack so they can be used as scratch space during the function. The x86_64 ABI specifies which registers are preserved across function calls and which can be freely used as scratch, and none of these are freely usable. Since registers are potentially thousands of times faster than the stack in some cases - in fact, the delay can stretch into the space of seconds if the stack happened to be paged out to disk! - it’s certain to be a win to use the stack once at the beginning and once at the end, and manipulate data in registers during the function’s execution.

1
    pushq   %r15
2
    pushq   %r14
3
    pushq   %r12
4
    pushq   %rbx

Instead of a bunch of values being stored to the stack, the optimizer has chosen to save the values of several registers to the stack so they can be used as scratch space during the function. The x86_64 ABI specifies which registers are preserved across function calls and which can be freely used as scratch, and none of these are freely usable. Since registers are potentially thousands of times faster than the stack in some cases - in fact, the delay can stretch into the space of seconds if the stack happened to be paged out to disk! - it’s certain to be a win to use the stack once at the beginning and once at the end, and manipulate data in registers during the function’s execution.

callq _objc_autoreleasePoolPush movq %rax, %r14 objc_autoreleasePoolPush takes no arguments and returns a simple integer value in rax. The optimizer saves the return value in r14 instead of spilling it to the stack.

1
    callq   _objc_autoreleasePoolPush
2
    movq    %rax, %r14

objc_autoreleasePoolPush takes no arguments and returns a simple integer value in rax. The optimizer saves the return value in r14 instead of spilling it to the stack.

movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rdi leaq l_objc_msgSend_fixup_alloc(%rip), %rsi callq *l_objc_msgSend_fixup_alloc(%rip) Load the MyClass class object into rdi, load the address of l_objc_msgSend_fixup_alloc into rsi, and call the function. It’s much the same sequence as the unoptimized code, but without the stack use and all in one place.

1
    movq    L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rdi
2
    leaq    l_objc_msgSend_fixup_alloc(%rip), %rsi
3
    callq   *l_objc_msgSend_fixup_alloc(%rip)

Load the MyClass class object into rdi, load the address of l_objc_msgSend_fixup_alloc into rsi, and call the function. It’s much the same sequence as the unoptimized code, but without the stack use and all in one place.

movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi leaq L__unnamed_cfstring_26(%rip), %rdx movq _objc_msgSend@GOTPCREL(%rip), %rbx movq %rax, %rdi movl $42, %ecx callq *%rbx movq %rax, %r15 Load the [MyClass initWithName:number:] selector into rsi, load @“name” into rdx, load the address of objc_msgSend@GOTPCREL into rbx, load the return value from alloc into rdi, load 42 into ecx, call objc_msgSend@GOTPCREL, and save the return value (i.e. obj) in r15. objc_msgSend@GOTPCREL? What in the world is that thing? Well, as it turns out, it’s more than meets the eye. If you peek at the generated machine code with a disassembler, it turns out to not be a mov instruction at all, but rather a lea! GOTPCREL is a directive which allows the rip-relative address of a function to be inserted at link time so a direct call can be made, if that address can be calculated at link time. objc_msgSend is one of the functions for which this is true, and optimization lets the compiler make the attempt. In other words, when optimization is on, the compiler generates code that makes a short, fast call to the function instead of making it go through the slower dynamic library call, potentially a “far jump” (a branch over a long distance of code, which, by necessity, is much slower). Note: I’m not 100% sure of my facts on this one; I’d appreciate any insight anyone has on the specifics of @GOTPCREL.

1
    movq    L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
2
    leaq    L__unnamed_cfstring_26(%rip), %rdx
3
    movq    _objc_msgSend@GOTPCREL(%rip), %rbx
4
    movq    %rax, %rdi
5
    movl    $42, %ecx
6
    callq   *%rbx
7
    movq    %rax, %r15

Load the [MyClass initWithName:number:] selector into rsi, load @“name” into rdx, load the address of objc_msgSend@GOTPCREL into rbx, load the return value from alloc into rdi, load 42 into ecx, call objc_msgSend@GOTPCREL, and save the return value (i.e. obj) in r15.

objc_msgSend@GOTPCREL? What in the world is that thing? Well, as it turns out, it’s more than meets the eye. If you peek at the generated machine code with a disassembler, it turns out to not be a mov instruction at all, but rather a lea! GOTPCREL is a directive which allows the rip-relative address of a function to be inserted at link time so a direct call can be made, if that address can be calculated at link time. objc_msgSend is one of the functions for which this is true, and optimization lets the compiler make the attempt.

In other words, when optimization is on, the compiler generates code that makes a short, fast call to the function instead of making it go through the slower dynamic library call, potentially a “far jump” (a branch over a long distance of code, which, by necessity, is much slower).

Note: I’m not 100% sure of my facts on this one; I’d appreciate any insight anyone has on the specifics of @GOTPCREL.

movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi movq %r15, %rdi callq *%rbx Load the name selector into rsi, load obj from r15 into rdi, and call objc_msgSend again. This is a case where optimization really begins to show its use. The unoptimized version of this call saved and loaded to and from the stack and other registers, effectively redoing the entire setup for the second message send. The optimizer recognizes that the extra data copying is redundant and just loads everything directly - and even more importantly, avoids loading data that’s already in a register into it again.

1
    movq    L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
2
    movq    %r15, %rdi
3
    callq   *%rbx

Load the name selector into rsi, load obj from r15 into rdi, and call objc_msgSend again. This is a case where optimization really begins to show its use. The unoptimized version of this call saved and loaded to and from the stack and other registers, effectively redoing the entire setup for the second message send. The optimizer recognizes that the extra data copying is redundant and just loads everything directly - and even more importantly, avoids loading data that’s already in a register into it again.

movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue Grab the return value from the last message send and immediately pass it to objc_retainAutoreleasedReturnValue. This is the same sequence as the unoptimized code. In fact, in the Objective-C runtime, certain operations work differently based specifically on the existence of these exact two instructions.

1
    movq    %rax, %rdi
2
    callq   _objc_retainAutoreleasedReturnValue

Grab the return value from the last message send and immediately pass it to objc_retainAutoreleasedReturnValue. This is the same sequence as the unoptimized code. In fact, in the Objective-C runtime, certain operations work differently based specifically on the existence of these exact two instructions.

movq %rax, %rbx movq %rbx, %rdi callq _MyFunction movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue movq %rax, %r12 Call MyFunction(name), retain its return value, and save the result in r12. The extra store to rbx looks redundant, but it isn’t, as we’ll see futher down.

1
    movq    %rax, %rbx
2
    movq    %rbx, %rdi
3
    callq   _MyFunction
4
    movq    %rax, %rdi
5
    callq   _objc_retainAutoreleasedReturnValue
6
    movq    %rax, %r12

Call MyFunction(name), retain its return value, and save the result in r12. The extra store to rbx looks redundant, but it isn’t, as we’ll see futher down.

movq %rbx, %rdi callq _objc_release See? Both rax and rdi have already been reused since [MyClass name]‘s return value was saved off in rbx. Not redundant after all! “But why didn’t the compiler just leave it in rbx to begin with?” Remember that the first parameter to a function must be in rdi. The value had to be saved somewhere that wasn’t about to be overwritten by the very next thing done.

1
    movq    %rbx, %rdi
2
    callq   _objc_release

See? Both rax and rdi have already been reused since [MyClass name]‘s return value was saved off in rbx. Not redundant after all!

“But why didn’t the compiler just leave it in rbx to begin with?” Remember that the first parameter to a function must be in rdi. The value had to be saved somewhere that wasn’t about to be overwritten by the very next thing done.

leaq L__unnamed_cfstring_23(%rip), %rdi movq %r12, %rsi xorb %al, %al callq _NSLog Call NSLog(@”%@”, return value of MyFunction) with no vector registers used - remember that variadic functions require the number of vector registers used as parameters to be in al. Nothing special here.

1
    leaq    L__unnamed_cfstring_23(%rip), %rdi
2
    movq    %r12, %rsi
3
    xorb    %al, %al
4
    callq   _NSLog

Call NSLog(@”%@”, return value of MyFunction) with no vector registers used - remember that variadic functions require the number of vector registers used as parameters to be in al. Nothing special here.

movq %r12, %rdi callq _objc_release movq %r15, %rdi callq _objc_release Release both objects (return value of MyFunction and obj) that are no longer in use. Technically, obj was already unused at the time of NSLog, but ARC’s code flow analysis isn’t that aggressive; releases are done at the end of the enclosing scope, not the instant the value is no longer used. Note: The return from [MyClass name] had an effective enclosing scope of the MyFunction call itself; it was never assigned to a variable (specifically, to a __strong variable), and therefore was not considered potentially “live” after the function call.

1
    movq    %r12, %rdi
2
    callq   _objc_release
3
    movq    %r15, %rdi
4
    callq   _objc_release

Release both objects (return value of MyFunction and obj) that are no longer in use. Technically, obj was already unused at the time of NSLog, but ARC’s code flow analysis isn’t that aggressive; releases are done at the end of the enclosing scope, not the instant the value is no longer used.

Note: The return from [MyClass name] had an effective enclosing scope of the MyFunction call itself; it was never assigned to a variable (specifically, to a __strong variable), and therefore was not considered potentially “live” after the function call.

movq %r14, %rdi callq _objc_autoreleasePoolPop xorl %eax, %eax popq %rbx popq %r12 popq %r14 popq %r15 popq %rbp ret Pop the autorelease pool, set eax to zero as the return value of main, restore the saved registers, and return.

1
    movq    %r14, %rdi
2
    callq   _objc_autoreleasePoolPop
3
    xorl    %eax, %eax
4
    popq    %rbx
5
    popq    %r12
6
    popq    %r14
7
    popq    %r15
8
    popq    %rbp
9
    ret

Pop the autorelease pool, set eax to zero as the return value of main, restore the saved registers, and return.

And that is main in optimized code. The major effects of optimization visible here are much better utilization of registers; there’s not a single use of the stack except for register saving, and there’s not a single redundant or useless data copy anywhere to be found.

Do you think you can do better than the compiler did? It’s possible that other optimization opportunities exist, but most of the ones that seem immediately obvious are actually prohibited by the CPU, the ABI, or the way Objective-C and ARC work.

Hint: The push and pop of rbp, as well as the copy of rsp to rbp, are unnecessary, because the optimization removed all references to rbp in the function body! Without those three instructions, main would still work, but the debugger might not! The debugger relies in some cases upon the presence of stack frames, which include a properly initialized base pointer register and the saved value of the base pointer on the stack. Certain other system functions can potentially rely upon presence of a stack frame, though these rarely come up in normal use. On OS X, the switch which tells GCC and Clang to skip the use of stack frames is disabled by default even at high optimization, suggesting that someone thought it wasn’t worth saving three instructions per function. It probably isn’t. The system frameworks are built with stack frames intact, for example. In general, you should always include stack frames unless you have a good reason not to.

The MyFunction FunctionNext, let’s look at the MyFunction function:

1
    NSString *MyFunction(NSString *parameter)
2
    {
3
        NSString *string2 = [@"Prefix" stringByAppendingString: parameter];
4
        NSLog(@"%@", string2);
5
        return string2;
6
    }

I’m going to take this function backwards. Instead of looking directly at the assembler the compiler produced, I’ll construct it myself using what we’ve already learned from main about how the compiler does its thing. This function doesn’t do anything that main didn’t, after all. For bonus points, we’ll even insert the necessary ARC calls.

Function prologue: _MyFunction: pushq %rbp movq %rsp, %rbp Every C function has a prologue. See the discussion about stack frames above. This is the stack frame for our new function, along with its label, for completeness’ sake. All C function names are prepended with an underscore at the assembler stage as a matter of language convention. A look at the name table of any executable or library will show that almost all of the symbols have at least one preceding underscore.

Function prologue:

1
    _MyFunction:
2
        pushq %rbp
3
        movq %rsp, %rbp

Every C function has a prologue. See the discussion about stack frames above. This is the stack frame for our new function, along with its label, for completeness’ sake. All C function names are prepended with an underscore at the assembler stage as a matter of language convention. A look at the name table of any executable or library will show that almost all of the symbols have at least one preceding underscore.

Save registers: pushq %rbx We’ll only need one scratch register for this function, so let’s use rbx.

Save registers:

1
    pushq %rbx

We’ll only need one scratch register for this function, so let’s use rbx.

Call stringByAppendingString: movq %rdi, %rdx leaq L_prefix_string_reference(%rip), %rdi movq L_stringByAppendingString__selector_reference(%rip), %rsi callq *_objc_msgSend@GOTPCREL(%rip) movq %rax, %rdi callq objc_retainAutoreleasedReturnValue First, I make the assumption that the string @“Prefix” appears somewhere given the label L_prefix_string_reference, which I just made up. Label names are arbitrary; the compiler’s very official-looking names are just autogenerated. Even having L in front of them is just a convention I chose to follow to make it look more like the compiler’s version. Likewise, I assume that L_stringByAppendingString__selector_reference points to the appropriate selector name. From there, I move rdi to rdx. Since parameter, being the first parameter to MyFunction, was in rdi, I’ve now made it the third parameter to whatever I’m about to call. I load @“prefix” as the first argument and the -stringByAppendingString: selector as the second, then call the rip-relative version of objc_msgSend. Finally, I take the return value and pass it to objc_retainAutoreleasedReturnValue, per ARC’s requirements. ARC functions only at the Objective-C compiler level; in assembler, it has to be invoked manually, like normal retain-release code but with stricter rules.

Call stringByAppendingString:

1
    movq %rdi, %rdx
2
    leaq L_prefix_string_reference(%rip), %rdi
3
    movq L_stringByAppendingString__selector_reference(%rip), %rsi
4
    callq *_objc_msgSend@GOTPCREL(%rip)
5
    movq %rax, %rdi
6
    callq _objc_retainAutoreleasedReturnValue

First, I make the assumption that the string @“Prefix” appears somewhere given the label L_prefix_string_reference, which I just made up. Label names are arbitrary; the compiler’s very official-looking names are just autogenerated. Even having L_ in front of them is just a convention I chose to follow to make it look more like the compiler’s version. Likewise, I assume that L_stringByAppendingString__selector_reference points to the appropriate selector name. From there, I move rdi to rdx. Since parameter, being the first parameter to MyFunction, was in rdi, I’ve now made it the third parameter to whatever I’m about to call. I load @“prefix” as the first argument and the -stringByAppendingString: selector as the second, then call the rip-relative version of objc_msgSend. Finally, I take the return value and pass it to objc_retainAutoreleasedReturnValue, per ARC’s requirements. ARC functions only at the Objective-C compiler level; in assembler, it has to be invoked manually, like normal retain-release code but with stricter rules.

Call NSLog: movq %rax, %rsi leaq L_format_string_reference(%rip), %rdi xorb %al, %al callq _NSLog I’ll tell you right now that this code is wrong in one important respect: Because I know I’ll need the return value from -stringByAppendingString: later, I’ve made the mistaken assumption that rax and rsi will not be changed by the call to NSLog. However, the x86_64 ABI explicitly specifies that both registers are not preserved across function calls. We’ve already clobbered them several times during the course of this code without saving them, so we can hardly expect NSLog not to do the same. (Not only that, but the code itself zeroes out the low byte of rax as part of the call sequence!) The value in rax and rsi before this section of code must be preserved, or it will be lost during the call. Note: Even if NSLog just so happened to preserve rsi, that’s not an assumption the calling code can make safely. The only time you can assume registers are preserved by a function outside the specification of the ABI is when you have written every line of that function yourself, in assembly language, and have documented the requirement so you don’t violate it later on. The solution is to replace the first movq with these two lines: movq %rax, %rbx movq %rbx, %rsi The value (known as string2 in the original Objective-C source) is now saved in rbx so we can use it. This is why I saved rbx at the beginning of the function.

Call NSLog:

1
    movq %rax, %rsi
2
    leaq L_format_string_reference(%rip), %rdi
3
    xorb %al, %al
4
    callq _NSLog

I’ll tell you right now that this code is wrong in one important respect: Because I know I’ll need the return value from -stringByAppendingString: later, I’ve made the mistaken assumption that rax and rsi will not be changed by the call to NSLog. However, the x86_64 ABI explicitly specifies that both registers are not preserved across function calls. We’ve already clobbered them several times during the course of this code without saving them, so we can hardly expect NSLog not to do the same. (Not only that, but the code itself zeroes out the low byte of rax as part of the call sequence!) The value in rax and rsi before this section of code must be preserved, or it will be lost during the call. Note: Even if NSLog just so happened to preserve rsi, that’s not an assumption the calling code can make safely. The only time you can assume registers are preserved by a function outside the specification of the ABI is when you have written every line of that function yourself, in assembly language, and have documented the requirement so you don’t violate it later on. The solution is to replace the first movq with these two lines:

1
    movq %rax, %rbx
2
    movq %rbx, %rsi

The value (known as string2 in the original Objective-C source) is now saved in rbx so we can use it. This is why I saved rbx at the beginning of the function.

Return from the function: movq %rbx, %rdi popq %rbx popq %rbp jmp _objc_autoreleaseReturnValue ## TAIL CALL Whoa, whoa, wait, what’s all this? What’s a tail call? In ARC mode, an object returned from a function not annotated as cf/ns_returns_retained must be passed to objc_autoreleaseReturnValue. Therefore, that has to be the very last thing the function does before returning. “So,” you ask, “why not movq %rbx, %rdi, then callq _objc_autoreleaseReturnValue, and let rax keep that return value while you popq and ret?” Answer: Because it’s inefficient. When the very last thing a function does is return the identically-typed result of calling another function, a tail call can be used to save time, space, and effort. At the time of the first movq instruction, the stack looks something like this: +----------------+ | RETURN ADDRESS | 16 <--- next instruction in main, pushed by callq _MyFunction | Saved %rbp | 8 <--- saved value of rbp, pushed by prologue | Saved %rbx | 0 <--- saved value of rbx, pushed by our code +----------------+ If I were to simply callq _objc_autreleaseReturnValue, the stack would then look like this: +----------------+ | RETURN ADDRESS | 24 <--- next instruction in main, pushed by callq _MyFunction | Saved %rbp | 16 <--- saved value of rbp, pushed by prologue | Saved %rbx | 8 <--- saved value of rbx, pushed by our code +----------------+ | RETURN ADDRESS | 0 <---- next instruction in MyFunction, pushed by callq _objc_autoreleaseReturnValue +----------------+ When objc_autoreleaseReturnValue returned, the stack would be popped by the ret instruction and go back to exactly where it was, and then the same thing would immediately happen again. Wouldn’t it be more efficient if objc_autoreleaseReturnValue could return directly to main, since MyFunction has absolutely nothing left to do? This is what a tail call does. Instead of using call, which pushes a new return address to the stack, MyFunction restores the stack to having only main’s return address, and then jumps directly to objc_autoreleaseReturnValue. The stack ends up looking like this: +----------------+ | RETURN ADDRESS | 0 <---- next instruction in main, pushed by callq _MyFunction! +----------------+ Now, when the ret in objc_autoreleaseReturnValue pops a return address off the stack into rip, it’ll jump directly back to main, with rax containing the return value exactly as it should. We’ve saved a push, a pop, and less visibly, some extra work by the CPU. The jmp instruction is also potentially smaller than callq if it should happen that the target function is located nearby in memory. Tail calls may look like a minor optimization from the assembly language point of view, but the savings of an entire extra stack frame can make or break recursive algorithms. Also, objc_msgSend is fundamentally designed around the use of a tail call; Cocoa programs would probably be something like an order of magnitude slower without them, and can you imagine loading a program in the debugger and seeing objc_msgSend before every single method call in the backtrace?

Return from the function:

1
    movq %rbx, %rdi
2
    popq %rbx
3
    popq %rbp
4
    jmp _objc_autoreleaseReturnValue ## TAIL CALL

Whoa, whoa, wait, what’s all this? What’s a tail call?

In ARC mode, an object returned from a function not annotated as cf/ns_returns_retained must be passed to objc_autoreleaseReturnValue. Therefore, that has to be the very last thing the function does before returning.

“So,” you ask, “why not movq %rbx, %rdi, then callq _objc_autoreleaseReturnValue, and let rax keep that return value while you popq and ret?” Answer: Because it’s inefficient. When the very last thing a function does is return the identically-typed result of calling another function, a tail call can be used to save time, space, and effort. At the time of the first movq instruction, the stack looks something like this:

1
    +----------------+
2
    | RETURN ADDRESS | 16 <--- next instruction in main, pushed by `callq _MyFunction`
3
    |   Saved %rbp   | 8  <--- saved value of rbp, pushed by prologue
4
    |   Saved %rbx   | 0  <--- saved value of rbx, pushed by our code
5
    +----------------+

If I were to simply callq _objc_autreleaseReturnValue, the stack would then look like this:

1
    +----------------+
2
    | RETURN ADDRESS | 24 <--- next instruction in main, pushed by `callq _MyFunction`
3
    |   Saved %rbp   | 16 <--- saved value of rbp, pushed by prologue
4
    |   Saved %rbx   | 8  <--- saved value of rbx, pushed by our code
5
    +----------------+
6
    | RETURN ADDRESS | 0  <---- next instruction in MyFunction, pushed by `callq _objc_autoreleaseReturnValue`
7
    +----------------+

When objc_autoreleaseReturnValue returned, the stack would be popped by the ret instruction and go back to exactly where it was, and then the same thing would immediately happen again. Wouldn’t it be more efficient if objc_autoreleaseReturnValue could return directly to main, since MyFunction has absolutely nothing left to do?

This is what a tail call does. Instead of using call, which pushes a new return address to the stack, MyFunction restores the stack to having only main’s return address, and then jumps directly to objc_autoreleaseReturnValue. The stack ends up looking like this:

1
    +----------------+
2
    | RETURN ADDRESS | 0  <---- next instruction in main, pushed by `callq _MyFunction`!
3
    +----------------+

Now, when the ret in objc_autoreleaseReturnValue pops a return address off the stack into rip, it’ll jump directly back to main, with rax containing the return value exactly as it should. We’ve saved a push, a pop, and less visibly, some extra work by the CPU. The jmp instruction is also potentially smaller than callq if it should happen that the target function is located nearby in memory.

Tail calls may look like a minor optimization from the assembly language point of view, but the savings of an entire extra stack frame can make or break recursive algorithms. Also, objc_msgSend is fundamentally designed around the use of a tail call; Cocoa programs would probably be something like an order of magnitude slower without them, and can you imagine loading a program in the debugger and seeing objc_msgSend before every single method call in the backtrace?

If you look at Clang’s version of the assembler code, it’s almost exactly the same as ours! There are three exceptions:

Clang, of course, names the string and selector references differently.
Clang moves the parameters around in a slightly different order; this has no effect on the execution of the code.
For no immediately apparent reason, Clang saves the value of rax on the stack, only to ignore that value entirely in the function epilogue. What’s actually happening is that Clang is aligning the stack to a 16-byte boundary, as required by both SSE instructions in particular and Cocoa in general. This leads to a total of 32 bytes (an even multiple of 16) on the stack for the function: The return address for main, saved rbp, saved rbx, and saved rax. The requirement of stack alignment is sufficient to overcome the desire to save instructions; the code would be incorrect without that alignment, and probably crash the very next time objc_msgSend was called.

Here, then, is the final version of the function as we’ve written it, including an aligned stack:

Here’s the entire listing in one chunk as we’ve written it:

1
    _MyFunction:
2
        pushq %rbp
3
        movq %rsp, %rbp
4
        pushq %rbx
5
        pushq %rax
6
        movq %rdi, %rdx
7
        leaq L_prefix_string_reference(%rip), %rdi
8
        movq L_stringByAppendingString__selector_reference(%rip), %rsi
9
        callq *_objc_msgSend@GOTPCREL(%rip)
10
        movq %rax, %rdi
11
        callq _objc_retainAutoreleasedReturnValue
12
        movq %rax, %rsi
13
        leaq L_format_string_reference(%rip), %rdi
14
        xorb %al, %al
15
        callq _NSLog
16
        movq %rax, %rbx
17
        movq %rbx, %rsi
18
        movq %rbx, %rdi
19
        addq $8, %rsp # ignore the saved rax
20
        popq %rbx
21
        popq %rbp
22
        jmp _objc_autoreleaseReturnValue ## TAIL CALL

Simple Floating-PointNext, I’ll look at a new function as a simple example of dealing with non-integer values. Here is the Objective-C version:

1
    float MyFPFunction(float parameter)
2
    {
3
        float x = parameter + 0.5;
4

5
        x -= 0.3f;
6
        return x;
7
    }

The line in which I call it:

1
    NSLog(@"%f", MyFPFunction(1.0));

And here is the assembler Clang produces:

1
    LCPI7_0:
2
        .long   1056964608              ## float 5.000000e-01
3
    LCPI7_1:
4
        .long   3197737370              ## float -3.000000e-01
5
    _MyFPFunction:                          ## @MyFPFunction
6
        pushq   %rbp
7
        movq    %rsp, %rbp
8
        addss   LCPI7_0(%rip), %xmm0
9
        addss   LCPI7_1(%rip), %xmm0
10
        popq    %rbp
11
        ret

(I’ve omitted the assembler for the actual function call, as it turns out to be extremely difficult to get Clang to actually emit such assembly under optimizing compilation without just inlining the function, and the unoptimized version is different. The only interesting note there in any case is the setting of al to 1 for the NSLog call, as it uses a vector register.)

The function is extremely simple:

A standard prologue comes first.
Then, since the ABI specifies that the first floating-point value is passed in the first vector register, xmm0, the function operates directly on that register. The addss instruction, in simple terms, adds two floating-point values (“add signed single-precision”). The constants in the code, 0.5 and -0.3 (subtracting 0.3 is the same as adding -0.3) are stored as data in the executable, since neither assembly language nor the actual machine code have a way to express floating-point immediate values. The values themselves are stored as IEEE-754 single-precision numbers. It just so happens that a floating-point return value is also stored in the first vector register, so by operating directly on xmm0, the function has already done everything it needed to do.
Finally, a standard function epilogue.

Wasn’t that simple? It turns out that the only thing you have to do to use floating-point values is switch to the 128-bit vector registers and the SSE1 instruction set. The old mmx and st(n) registers, along with the x87 instruction set, are obsolete. They’re also inefficient in comparison to SSE1 operations.

The C runtimeSome things are going on behind the scenes when you launch your program. Did you know that main isn’t the first function the system calls?

That’s right! Once dyld has finished setting up your process’ memory space, it branches to the standard entry point, a function called start which is copied verbatim from the C runtime library (libcrt) into your executable. It’s written in pure assembly and will not appear in Clang’s assembler output, as it doesn’t exist in your program until linking is done. Here’s a look at it. I’ve borrowed the source code from Apple’s website. Per the terms of the APSL under which the code is licensed, I’ve included the APSL license header in the code listing.

dyld sees the LC_UNIXTHREAD load command in your binary and sets up the CPU state accordingly for the new process. A quick glance at the output of otool -l tells us that the rip register is initialized to the load address of the start symbol in the binary image! Clever, no?

The start function consists of the following code:

1
    /*
2
     * Copyright (c) 1999-2008 Apple Inc. All rights reserved.
3
     *
4
     * @APPLE_LICENSE_HEADER_START@
5
     *
6
     * Portions Copyright (c) 1999 Apple Computer, Inc.  All Rights
7
     * Reserved.  This file contains Original Code and/or Modifications of
8
     * Original Code as defined in and that are subject to the Apple Public
9
     * Source License Version 1.1 (the "License").  You may not use this file
10
     * except in compliance with the License.  Please obtain a copy of the
11
     * License at http://www.apple.com/publicsource and read it before using
12
     * this file.
13
     *
14
     * The Original Code and all software distributed under the License are
15
     * distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, EITHER
16
     * EXPRESS OR IMPLIED, AND APPLE HEREBY DISCLAIMS ALL SUCH WARRANTIES,
17
     * INCLUDING WITHOUT LIMITATION, ANY WARRANTIES OF MERCHANTABILITY,
18
     * FITNESS FOR A PARTICULAR PURPOSE OR NON- INFRINGEMENT.  Please see the
19
     * License for the specific language governing rights and limitations
20
     * under the License.
21
     *
22
     * @APPLE_LICENSE_HEADER_END@
23
     */
24
    start:  pushq   $0          # push a zero for debugger end of frames marker
25
            movq    %rsp,%rbp       # pointer to base of kernel frame
26
            andq    $-16,%rsp       # force SSE alignment
27
            movq    8(%rbp),%rdi        # put argc in %rdi
28
            leaq    16(%rbp),%rsi       # addr of arg[0], argv, into %rsi
29
            movl    %edi,%edx       # copy argc into %rdx
30
            addl    $1,%edx         # argc + 1 for zero word
31
            sall    $3,%edx         # * sizeof(char *)
32
            addq    %rsi,%rdx       # addr of env[0], envp, into %rdx
33
            movq    %rdx,%rcx
34
            jmp Lapple2
35
    Lapple: add $8,%rcx
36
    Lapple2:cmpq    $0,(%rcx)       # look for NULL ending env[] array
37
            jne Lapple
38
            add $8,%rcx         # once found, next pointer is "apple" parameter now in %rcx
39
            call    _main
40
            movl    %eax,%edi       # pass result from main() to exit()
41
            call    _exit           # need to use call to keep stack aligned
42
            hlt

start doesn’t work like a C function, since it isn’t one. It’s intended specifically to transition from a bare-bones executable state to one that C (and Objective-C) can work in. Even the function prologue is unusual.

pushq $0 - Push a zero on the stack. This is used by the debugger as a marker for ‘end of stack frames’, replacing the pushq %rbp in a normal function’s prologue.
movq %rsp,%rbp - Grab hold of the stack pointer, since the stack is actually used in this function.
andq $-16,%rsp - Mask off the last four bits of the stack pointer. This aligns the initial stack to a 16-byte boundary, as SSE instructions and Cocoa in general require. It’s probably an effective no-op, as the system will tend to give a properly aligned stack already, but the C runtime doesn’t and can’t make that assumption.
movq 8(%rbp),%rdi - The ‘kernel frame’ the comment mentions above is what exists on the stack when dyld calls start. The first (topmost) value is the familiar argc parameter to main. Putting it in rdi sets it up as the first argument for a function call.
leaq 16(%rbp),%rsi - The second value on the stack is argv, so it’s now a second function parameter.
movl %edi,%edx - Grab the low 4 bytes of argc into rdx.
addl $1,%edx - Add 1 to the copy of argc
sall $3,%edx - Multiply the value by 8 (shifting left by 3 is equivalent). edx now contains the entire size in bytes of the argv array.
addq %rsi,%rdx - Add the address of argv to the calculated size, yielding a pointer to the end of argv. Why is this happening? On OS X, the little-used envp array passed as a third parameter to main occupies the space in memory immediately following argv. The third function parameter is now envp.
movq %rdx,%rcx - Now copy envp to the fourth function parameter.
jmp Lapple2 Lapple: add $8,%rcx Lapple2:cmpq$ 0,(%rcx) # look for NULL ending env[] array jne Lapple These four lines constitute a simple loop which increases the value of rcx by 8 until the memory location it points to contains zero. In C terms, this would be while (*((uint64_t *)rcx)++);. The jne instruction means “jump if not equal”, or equivalently, “jump if ZF is zero”. ZF was set by the previous instruction, cmp, which says “set rflags based on the result of subtracting the two operands, discarding the result itself”. This loop finds the end of the NULL-terminated envp array.

1
            jmp Lapple2
2
    Lapple: add $8,%rcx
3
    Lapple2:cmpq    $0,(%rcx)       # look for NULL ending env[] array
4
            jne Lapple

These four lines constitute a simple loop which increases the value of rcx by 8 until the memory location it points to contains zero. In C terms, this would be while (*((uint64_t *)rcx)++);. The jne instruction means “jump if not equal”, or equivalently, “jump if ZF is zero”. ZF was set by the previous instruction, cmp, which says “set rflags based on the result of subtracting the two operands, discarding the result itself”. This loop finds the end of the NULL-terminated envp array.

addq $8,%rcx - Skip to the next pointer after the end of envp, which is exec_path, the fourth argument to main, though it’s little-known and even more little-used.
callq _main - Finally, call main itself.
movl %eax,%edi - Load main’s 4-byte return value as the first parameter to a function call.
callq _exit - Call the exit(2) function, passing it the value returned from main. exit(2) never returns, so no instructions following this one should ever be executed.
hlt - Just in case somehow execution gets here anyway, “halt” the CPU. hlt will cause a privilege violation exception if executed by non-kernel code, so it makes a fitting “you should not be here” epilogue. It’s effectively the equivelant of “unreachable”. On very old x86 processors, an application would call hlt to stop the CPU, but with all the other hardware in a modern computer that needs to be shut down properly, a single instruction is simply inadequate to the purpose. It wouldn’t turn off the power, for example.

ConclusionThere’s no need to look at the rest of the sample code’s disassembly; there’s nothing in it that I haven’t already explored elsewhere. If you can’t make sense of it on your own by now, I’ve probably done a poor job of explaining! Therefore, I hereby mark the end of part 2.

I’ve gotten several requests since part 1 to explain these concepts in terms of the ARM architecture used by the iPhone and other iDevices. I haven’t worked with ARM at this level before now, but I’m always willing to learn new things. So I’ve started studying the ARM architecture, and I’ll be writing a part 3 to this series of articles based on what I learn and using the same sample code. Until then, good luck, and I hope you’ve enjoyed my work so far!