June 10th, 2022 @ justine's web page

Size Optimization Tricks

This blog post will cover some of the tricks I've used in the past to make can be used in c / c++ / python binaries smaller using x86 assembly. Much of it will revolve around the Cosmopolitan codebase, since I recently received feedback from the ELKS project that they love the code and want to hear more about how the tricks cosmo uses can potentially improve projects as intriguing as a i8086 Linux port. In many ways I feel a kinship with the ELKS project, since the first thing I had to do, to build Cosmopolitan, was write an i8086 bootloader called Actually Portable Executable. Plus it pleased me to hear that people who've been focusing on the problem a lot longer than I have are pleased with what they've read in Cosmopolitan so far. So I figured it'd be nice to share with a broader audience.

[Shinmyoumaru Sukuna]

Table of Contents

  1. Why It Matters
  2. Look at the Binary
  3. Field Arrangement
  4. Run Length Encoding
  5. Decentralized Sections
  6. Dead Code Elimination
  7. δzd Encoding
  8. Overlapping Functions
  9. Optimizing Printf
  10. Tiny Portable ELF
  11. Funding
  12. See Also

Why It Matters

I like the UNIX philosphy of having lots of small programs. I like it so much, that the Cosmopolitan repository builds hundreds of them.
$ git clone https://github.com/jart/cosmopolitan
$ cd cosmopolitan
$ make -j16 MODE=tiny
$ find o/tiny -name \*.com | wc -l
741

The whole repository takes about fifty seconds to build from scratch on my PC. Out of those 741 executables, 403 are test executables. It's nice to have each set of tests be in a separate executable, to reduce the chance of a bad test messing up the global state, and leading to hard to diagnose unrelated failures somewhere else. It also means htop can serve as a test status free dashboard. Each dot com file is a static binary which behaves sort of like a Docker container, since they're usually zip files too which vendor assets (e.g. tzinfo data). Did I mention each executable also runs on seven operating systems?

# testing process
for f in $(find o/tiny/test -name \*_test.com); do
  for os in freebsd openbsd netbsd rhel7 rhel5 win7 win10 xnu; do
    scp $f $os:
    ssh $os ./${f##*/}
  done
done

So all I have to do to test all 400 programs on all 8 supported systems (3,200 tests total!) is simply scp them onto each host and run them. That takes about 15 seconds thanks to runit and runitd. It's been a big productivity boost and enables test-driven development (also known as TDD) with rapid feedback.

What makes this workflow manageable in that binaries are small. The tinier something is, the more instances we can have of it at scale. So it really isn't just about running on old computers. It's not necessarily about code golfing. It's about having infinitely more of a good thing on the cheap with a pleasant coding lifestyle. Here's some concrete examples of how small actually portable executables built with Cosmopolitan Libc can be:

 16K o/tiny/examples/hello2.com             # calls write  (links syscall)
 20K o/tiny/examples/hello.com              # calls fputs  (links stdio)
 24K o/tiny/examples/hello3.com             # calls printf (links fmt+stdio)
100K o/tiny/test/libc/calls/access_test.com # links test library
216K o/tiny/tool/net/redbean-original.com   # http/1.1 server
1.8M o/tiny/examples/pyapp/pyapp.com        # statically linked python app

Would my isolated hermetic testing workflow have been possible if I was using a language that needs at minimum 3mb to build a binary? My LAN only has gigabit ethernet. I don't need to work at a big company with industrial fabric switches to transfer these ~100kb test binaries over the wire. Let's do some back of the envelope calculations:

# theoretic minimum seconds transferring cosmo test binaries w/ 1gbps lan uncompressed
>>: (400*8 * 16*1000) / (1024*1024*1024/8)
0.3814697265625

# theoretic minimum seconds transferring go test binaries w/ 1gbps lan uncompressed
>>: (400*8 * 3*1000*1000) / (1024*1024*1024/8)
71.52557373046875 

That's a lot of time to be spending transfering files. Tiny isn't just about scale, but surviving long enough to scale. Cosmopolitan is a scrappy operation. The repo right now only has about 1.5 million lines of code. What I'm thinking about is how we're going to host the next billion lines of code, while everyone else is drowning in technical debt. Tiny is about the efficient use of resources. Big things must have small beginnings, because no one wants to build an empire of code on a tooling foundation that ravages their budget with cloud provider fees. Altavista ran an entire search engine on a single computer, and now you're telling me we need to plug our Full Story Elecron IDEs into a Kubernetes cluster to compile a C++ app that does math. How does that add up?

[ABC - Always Be Coding]

Losers always whine about how their bloat is a calculated tradeoff. The important thing to consider is that there is no tradeoff. Just engineers who've been victimized by the accidental complexity of modern software, chose to stop caring, and yearn for another break while their code compiles—like Dennis from Jurrasic Park. Bloat is like the fake jobs version of scalability, in the sense that bloat offers hungry devs the thrill of complexity without the advantages. I used to operate a storage system with exabytes of data, and it honestly didn't feel that different from what I'm doing now, working on the tiniest software in the world. It's the stuff in the middle I just don't find as appealing.

In any case, as an avid reader of codebases, bloat is unpleasant. I think codebases have been lacking in a woman's touch for decades; and that especially applies to open source, which has never really had a woman's touch at all. We can fix that, but overall, the best way to improve software development is to make it more fun. There's few things more fun than size coding.

Look at the Binary

I think the biggest disadvantage of anyone looking to size optimize systemically is the difficulty of intuitive thinking. Especially for guys who use traditional hex editors. Here's what run-length encoded data looks like:

0002d0b0  01 00 01 03 01 06 01 09  01 0c 01 0f 01 12 14 1e  |................|
0002d0c0  01 15 04 1e 01 18 17 1e  01 1e 1c 1e 01 1b 00 00  |................|
0002d0d0  21 01 01 00 02 01 01 00  01 01 09 00 01 01 0a 00  |!...............|
0002d0e0  01 01 01 00 01 01 01 00  03 01 1a 00 04 01 01 00  |................|
0002d0f0  01 01 1a 00 03 01 01 00  81 01 00 00 00 00 00 00  |................|
0002d100  21 01 01 00 02 01 01 00  01 01 16 00 01 01 01 00  |!...............|
0002d110  01 01 1c 00 04 01 01 00  01 01 1a 00 03 01 01 00  |................|
0002d120  81 01 00 00 00 00 00 00  21 01 01 00 02 01 01 00  |........!.......|
0002d130  01 01 09 00 01 01 0c 00  01 01 01 00 03 01 1a 00  |................|

Hex editors are a really good tool when you're looking for a specific thing and need a specific answer, but they reveal little visually about the shape and form of a binary. You're not actually looking. You're just studying empirical data with a razor sharp focus that can easily become tunnel-vision. Blinkenlights solves this problem by using IBM Code Page 437. If we use it to visualize the same run-length encoded data above, and it's able to give us a better glance.

[run-length encoded data in blinkenlights]

Please keep in mind, this is a purely intuitive display. It's going to look like gobbledygook to anyone who's unfamiliar with CP437, but surely anyone can agree it's better than a wall of period marks, in terms of the richness and complexity of the information it succinctly conveys. CP437 can be thought of as an alphabet with 256 letters. You can scroll through pages and pages of this stuff and your mind will magically spot and identify patterns. For example, you'll gain an intuition for what data does. For example, here's an indirect jump table:

[jumptab.png]

Here's what a struct looks like when it's compiled with __attribute__((__packed__)):

[packed.png]

Here's what x86-64 code looks like:

[x86-64-code.png]

Here's what /dev/urandom looks like:

[urandom.png]

And here's what binaries built with UBSAN (Undefined Behavior Sanitizer) look like. As we can see, it's got plenty of room for improvement, and obviously explains why UBSAN binaries are so enormous. It's poor struct packing.

[ubsan-bing.png]

Cosmopolitan is an ex nihilo codebase. I started this project with an empty file and an assembler. Pretty much every byte that's gone into APE binaries since then, I've had some role in either creating and/or monitoring. As such, the two very first tools I wrote with Cosmopolitan Libc were intended to help me do just that.

-rwxr-xr-x 1 501 jart 52K Jun  9 09:08 bing.com (see bing.c)
-rwxr-xr-x 1 501 jart 32K Jun  9 09:08 fold.com (see fold.c)

I use them with the following shell script wrapper named ~/bin/bf.

#!/bin/sh
bing.com <"$1" |
  fold.com |
  exec less

So whenever I want to take a peek inside a file, I just type a command like:

bf o//examples/hello.com

I've written a few other scripts that come in handy. For instance, here's ~/bin/bloat which shows which symbols in a .com.dbg file are the biggest.

#!/bin/sh
nm -C --size "$@" |
  sort -r |
  grep ' [bBtTRr] ' |
  exec less

Field Arrangement

Struct packing is a nice wholesome size optimization that's not the least bit controversial. One good example is something I once noticed about Python 3.6 when viewing it through bing | fold. A core Python parser struct has suboptimal field arrangement.

typedef struct {
    int		 s_narcs;
    arc		*s_arc;
    int		 s_lower;
    int		 s_upper;
    int		*s_accel;
    int		 s_accept;
} state;

I wouldn't have noticed this if I was reading the code, since sometimes the stuff underneath the iceberg isn't always clear. That struct above, is actually equivalent to the following at the binary level:

typedef struct {
    int		 s_narcs;
    int		 __pad1;   // four wasted bytes
    arc		*s_arc;
    int		 s_lower;
    int		 s_upper;
    int		*s_accel;
    int		 s_accept;
    int		 __pad2;   // four wasted bytes
} state;

So I moved the s_accept field to a different line, and it shaved 4kb off every Python binary.

typedef struct {
    int		 s_narcs;
    int		 s_accept;
    arc		*s_arc;
    int		 s_lower;
    int		 s_upper;
    int		*s_accel;
} state;

Everything counts in small amounts. I also imagine there's some corporate styleguides at various companies where the latter code would be preferred, strictly for policy reasons. Since it's not great from a readability standpoint to have people scratching their heads wondering where the gaps are in your data structures, if they need assurances at the application binary interface level. So we're not only saving space but leading to better code hygeine too.

Run Length Encoding

There are so many outstanding compression algorithms in the news, based on machine learning and even classical algorithms, that I imagine many self-taught programmers might not be familiar with the really simple primitive ones that, in certain cases, get the job done. One such example is run-length encoding. The way it works, is if you have a sequence such as:

1,1,1,1,1,1,1,1,2,2,2,3

Then you'd encode it as sequence of {count, thing} tuples with a {0,0} terminator.

8,1, 3,2, 1,3, 0,0

What I love about run-length encoding is (1) it's so simple that the compressed data stream can be edited by hand; and (2) its decompressor only requires 14 bytes of code.

/	Fourteen byte decompressor.
/
/	@param	di points to output buffer
/	@param	si points to uint8_t {len₁,byte₁}, ..., {0,0}
/	@arch	x86-64,i386,i8086
/	@see	libc/nexgen32e/rldecode.S
rldecode:
31 c9	xor	%ecx,%ecx
ac0:	lodsb
86 c1	xchg	%al,%cl
ac	lodsb
e3 05	jrcxz	2f
aa1:	stosb
e2 fd	loop	1b
eb f5	jmp	0b
c32:	ret
	.type	rldecode,@function
	.size	rldecode,.-rldecode
	.globl	rldecode

That particular example is beautiful, since it's binary compatible with i8086, i386, and x86-64. However assembly doesn't always mean fast. For better explainability and performance, the following C code should do the trick:

struct RlDecode {
  uint8_t repititions;
  uint8_t byte;
};

void rldecode2(uint8_t *p, const struct RlDecode *r) {
  for (; r->repititions; ++r) {
    memset(p, r->byte, r->repititions);
    p += r->repititions;
  }
}

Run-length encoding is only appropriate for compressing sparse data. For denser data, RLE, will double its size! However it just so happens that, when we're making binaries smaller, there's usually many sparse data structures that can be efficiently run-length encoded.

One such example are the character translation tables used by the redbean web server, when parsing URIs and HTTP messages. One thing most C/C++ developers figure out eventually is that it's really efficient to perform a character lookup in a table with 256 entries. Why? Because your C compiler will turn C code like rdi[al & 255] into assembly that looks like this:

	movzbl	%al,%eax
	mov	(%rdi,%rax),%al

That's super fast, and memory safe too. Hardware engineers would call it a LookUp Table or LUT for short. It's quite integral to the workings of microprocessors. The x86 architecture even has a deprecated instruction called XLAT that's dedicated to this very thing. With HTTP, the most important use case for LUTs is to check if characters are legal. For example, many of the components of an HTTP message are required to be "tokens" quoth RFC2616:

CHAR           = <any US-ASCII character (octets 0 - 127)>
SP             = <US-ASCII SP, space (32)>
HT             = <US-ASCII HT, horizontal-tab (9)>
CTL            = <any US-ASCII control character
                 (octets 0 - 31) and DEL (127)>
token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@"
               | "," | ";" | ":" | "\" | <">
               | "/" | "[" | "]" | "?" | "="
               | "{" | "}" | SP | HT

As plain C code, we could write our token LUT as follows:

const char kHttpToken[256] = {
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x00
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x10
   0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, // 0x20  ! #$%&‘  *+ -. 
   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, // 0x30 0123456789      
   0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x40  ABCDEFGHIJKLMNO
   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, // 0x50 PQRSTUVWXYZ   ^_
   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x60 `abcdefghijklmno
   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, // 0x70 pqrstuvwxyz | ~ 
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x80
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x90
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xa0
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xb0
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xc0
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xd0
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xe0
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xf0
};

That way we can just say kHttpToken[c & 255] to check if a character is valid in a token. But if we wanted to save 200 bytes of binary size, we could use run-length encoding.

	.globl	kHttpToken
	.section .bss.sort.100.kHttpToken,"aw",@nobits
/	@see	net/http/khttptoken.S
kHttpToken:
	.zero	256
	.section .rodata.sort.100.kHttpToken,"a",@progbits
	.byte	 33,0                   # 00-20 ∅-␠
	.byte	  1,1                   # 21-21 !-!
	.byte	  1,0                   # 22-22 “-“
	.byte	  5,1                   # 23-27 #-‘
	.byte	  2,0                   # 28-29 (-)
	.byte	  2,1                   # 2a-2b *-+
	.byte	  1,0                   # 2c-2c ,-,
	.byte	  2,1                   # 2d-2e --.
	.byte	  1,0                   # 2f-2f /-/
	.byte	 10,1                   # 30-39 0-9
	.byte	  7,0                   # 3a-40 :-@
	.byte	 26,1                   # 41-5a A-Z
	.byte	  3,0                   # 5b-5d [-]
	.byte	 29,1                   # 5e-7a ^-z
	.byte	  1,0                   # 7b-7b {-{
	.byte	  1,1                   # 7c-7c |-|
	.byte	  1,0                   # 7d-7d }-}
	.byte	  1,1                   # 7e-7e ~-~
	.byte	129,0                   # 7f-ff ⌂-λ
	.byte	0,0                     # terminator
	.section .init.sort.100.kHttpToken,"a",@progbits
	call	rldecode

There's a tool in the Cosmopolitan codebase for automatically generating these assembly files.

o//tool/build/xlat.com -TiC ' ()<>@,;:\"/[]?={}' -iskHttpToken

Decentralized Sections

The assembly code in the previous section that's generated by xlat.com might seem a bit strange, even for experienced x86 assembly programmers. It's a technique somewhat uniquely resurrected for Cosmopolitan, based on a reading of old code from the 70's and 80's. We'll call it "decentralized sections".

Decentralized sections are a way by which we can write a single function whose code spans multiple files. The canonical example of this, is the classic UNIX _init() function, which you'll find empty in nearly every modern codebase, because _init() is somewhat of a lost art intended to be assembled by the linker script.

The way this works is your libc defines the stubs for the prologue and epilogue of the _init() function. Here's how Cosmopolitan Libc defines it (with some slight edits to reduce macro usage in order to improve copy/pastability):

	.section .init_prologue,"ax",@progbits
	.globl	_init
/	@param	r12 is argc (still callee saved)
/	@param	r13 is argv (still callee saved)
/	@param	r14 is envp (still callee saved)
/	@param	r15 is envp (still callee saved)
/	@note	rdi is __init_bss_start (callee monotonic lockstep)
/	@note	rsi is __init_rodata_start (callee monotonic lockstep)
/	@see	libc/runtime/init.S
_init:	push	%rbp
	mov	%rsp,%rbp
	lea	__init_bss_start(%rip),%rdi
	lea	__init_rodata_start(%rip),%rsi
	.previous/*
	...
	decentralized content
	...
	*/.section .init_epilogue,"ax",@progbits
_woot:	leave
	ret
	.previous

Your libc also needs to declare stubs for the data sections. These technically could be declared in the linker script, but it's nicer to write them in a .S file so the symbols don't get defined unless they're being linked.

	.section .init_rodata_prologue,"a",@progbits
	.globl	__init_rodata_start,__init_rodata_end
	.align	__SIZEOF_POINTER__
__init_rodata_start:
	.previous/*
	...
	decentralized content
	...
	*/.section .init_rodata_epilogue,"a",@progbits
__init_rodata_end:
	.byte	0x90
	.previous

	.section .init_bss_prologue,"aw",@nobits
	.globl	__init_bss_start,__init_bss_end
	.align	__SIZEOF_POINTER__
__init_bss_start:
	.previous/*
	...
	decentralized content
	...
	*/.section .init_bss_epilogue,"aw",@nobits
__init_bss_end:
	.byte	0
	.previous

We then configure our linker script as follows:

/* see ape/ape.lds */
SECTIONS {
  .text . : {
    KEEP(*(.init_prologue))
    KEEP(*(SORT_BY_NAME(.init.sort.*)))
    KEEP(*(.init))
    KEEP(*(.init_epilogue))
    *(.text .text.*)
    KEEP(*(SORT_BY_NAME(.rodata.sort.*)))
    *(.rodata .rodata.*)
  }
  .bss . : {
    KEEP(*(SORT_BY_NAME(.bss.sort.*)))
    *(SORT_BY_ALIGNMENT(.bss))
    *(SORT_BY_ALIGNMENT(.bss.*))
    *(COMMON)
  }
}

It's now possible to write assembly files that insert content into the _init() function, along with the concomitant __init_rodata_start and __init_bss_start data structures. Both the read-only data and the uninitialized data (BSS) sections are optional. The only thing that's mandatory is that, unlike the normal System V Application Binary Interface, our _init() code isn't allowed to clobber RDI and RSI. Because they store lockstep pointers to the rodata and bss.

Here's an example than of how it can be used, that's simpler than the kHttpToken example above. Let's say I need to support the AMD K8 aand Barcelona microarchitectures, but I want to take advantage of SSSE3 (which is the best SSE). The industry practice here is to perform a CPUID check behind a double-checked pthread_once() guard, because the CPUID instruction takes 50 nanoseconds each time, whereas the once guard averages out over many calls to cost half a nanosecond. However the _init() function provides a better solution, since it obfuscates the need for synchronization libraries, due to how _init() is called before all other constructors, thereby making it highly unlikely that any threads to have been created.

	.section .bss.sort.100.kCpuid1,"aw",@nobits
/	uint32_t kCpuid1[4];
/	@see	libc/nexgen32e/kcpuids.S
kCpuid1:.long	0,0,0,0
	.globl	kCpuid1
	.type	kCpuid1,@object
	.size	kCpuid1,.-kCpuid1

	.section .init.sort.100.kCpuid1,"a",@progbits
53	push	%rbx
6a 01	push	$1
58	pop	%rax
0f a2	cpuid
ab	stosl
93	xchg	%eax,%ebx
ab	stosl
91	xchg	%eax,%ecx
ab	stosl
92	xchg	%eax,%edx
ab	stosl
5b	pop	%rbx

The code above only adds 14 bytes of content to the binary. It scales in terms of development model; any number of libraries throughout your codebase can follow this process without stepping on each others toes. As a model, decentralized sections really gives us a sweet spot that caters to the gigantic codebases big companies respect, while creating the tiny binaries indie developers adore. It's also much lighter weight than needing to pull in a dependency on -lpthread. Now, when I want to check if SSSE3 is available, all I have to do is check if the 9th bit of the ECX we stored earlier is set:

if (kCpuid1[2][1<<9]) {
  // we have ssse3!
} else {
  // it's k8 or barcelona
}

If you're using a different libc from cosmo (e.g. musl, glibc) then you can still use this same technique, today, with some slight alteration.

	.bss
kCpuid1:.long	0,0,0,0
	.globl	kCpuid1
	.type	kCpuid1,@object
	.size	kCpuid1,.-kCpuid1

	.section .init,"a",@progbits
53	push	%rbx
48 8d 3d 00 00 00 00	lea	kCpuid1(%rip),%rdi
6a 01	push	$1
58	pop	%rax
0f a2	cpuid
ab	stosl
93	xchg	%eax,%ebx
ab	stosl
91	xchg	%eax,%ecx
ab	stosl
92	xchg	%eax,%edx
ab	stosl
5b	pop	%rbx

Here we need to pay for 7 additional bytes (21 bytes total), because the lockstep cooperation between _init() / rodata / bss is unique to Cosmo and not implemented by other C libraries. But in either case, both assembly solutions are a clear win over the C/C++ equivalent, which needs 42 bytes.

uint32_t kCpuid1[4];

__attribute__((__constructor__)) static void kCpuid1Init() {
  uint32_t ax, bx, cx, dx;
  asm("cpuid" : "=a"(ax), "=b"(bx), "=c"(cx), "=d"(dx) : "0"(1));
  kCpuid1[0] = ax;
  kCpuid1[1] = bx;
  kCpuid1[2] = cx;
  kCpuid1[3] = dx;
}

If there's one thing I've learned working on Cosmopolitan, it's that with the right finesse, the C compiler can produce really tiny code if we're willing to write assembly inside string literals. The following code shows how to get GCC and Clang to do it in 30 bytes. Whether or not it's worth it versus just writing assembly is up to the reader.

uint32_t kCpuid1[4];

__attribute__((__constructor__)) static void kCpuid1Init() {
  uint32_t ax, *di;
  asm volatile("cpuid\r\n"
               "stosl\r\n"
               "xchg\t%%ebx,%%eax\r\n"
               "stosl\r\n"
               "xchg\t%%ecx,%%eax\r\n"
               "stosl\r\n"
               "xchg\t%%edx,%%eax\r\n"
               "stosl"
               : "=a"(ax), "=D"(di)
               : "0"(1), "1"(kCpuid1)
               : "rbx", "rcx", "rdx", "memory");
}

The main thing that makes the C code clunkier is it needs to generate an 8 byte function pointer in a .ctors section which needs to be added to the linker script separately. Your C library will normally call ctors after _init() is finished. There really aren't any acceptable solutions for getting a modern C compiler to generate "decentralized section" style code. The closest we can get is the following, using GCC (but certainly not Clang, which claims dominion over all registers).

register long di asm("rdi");
register long si asm("rsi");

__attribute__((__section__(".bss.sort.100.kCpuid1,\"aw\",@nobits #")))
uint32_t kCpuid1[4];

__attribute__((__used__,
               __section__(".init.sort.100.kCpuid1,\"a\",@progbits #")))
static void kCpuid1Init(void) {
  uint32_t ax;
  asm volatile("push\t%%rbx\r\n"
               "cpuid\r\n"
               "stosl\r\n"
               "xchg\t%%ebx,%%eax\r\n"
               "stosl\r\n"
               "xchg\t%%ecx,%%eax\r\n"
               "stosl\r\n"
               "xchg\t%%edx,%%eax\r\n"
               "stosl\r\n"
               "push\t%%rbx"
               : "=a"(ax), "+D"(di)
               : "0"(1)
               : "rcx", "rdx", "memory");
  __builtin_unreachable();  // prevent generation of RET instruction
}

The problem with the above code is it's brittle. You need to pass compiler flags like -ffixed-rdi -ffixed-rsi -fomit-frame-pointer -O -fno-sanitize=all for it to work properly. The value of a C compiler is being able to bloat our code with debugging tools (like ASAN) when we want them. So with code like this, where that can't happen, it's usually best to just write assembly.

Dead Code Elimination

Naturally, if you're a Cosmopolitan Libc user, all the stuff in the CPUID examples above would would be taken care of for you, and you need to say is this:

// see libc/nexgen32e/x86feature.h
if (X86_HAVE(SSSE3)) {
  // we have ssse3!
} else {
  // it's k8 or barcelona
}

It's worth taking a look at its implementation in x86feature.h since it's quite possibly the most beautiful abstraction created with the C preprocessor you're likely to find in the entire codebase. It also embodies one of the most important patterns of Cosmopolitan macros, which is the hybridization of compile-time and runtime checking. It's needed by Actually Portable Executable (which relies heavily on runtime dispatch) while still granting all the traditional compile-time define flag tuning that developers love. For example, the #ifdef model could be implemented here as follows:

#if X86_REQUIRE(SSSE3)
  // we have ssse3!
#else
  // it's k8 or barcelona
#endif

In that case, your program can only support SSSE3 if the user passes the -mssse3 flag to the compiler. It's better to use the X86_HAVE() macro because it does the same thing as X86_REQUIRE(SSSE3) except it has a runtime check too. By default, if you pass no microarchitecture flags to GCC or Clang, both branches get included in the binary.

if (X86_HAVE(SSSE3)) {
  // do some wild ssse3 assembly hacks
} else {
  // fallback to slow ansi c code
}

But if the user chooses to pass the -mssse3 flag, then the fallback bloat for supporting old CPU models gets removed by a compiler optimization pass, along with the runtime check itself.

if (X86_HAVE(SSSE3)) {
  // do some wild ssse3 assembly hacks
} else {
  // fallback to slow ansi c code
}

This is what MODE=opt does with the Makefile build config by default. It passes the -march=native flag to the compiler, which asks it to figure out what features the CPU in your host machine has, and then Cosmopolitan and GCC work together to produce a binary that's just right and optimally fast your machine. The tradeoff is you won't be able to distribute any such binaries, since they might not run on other x86 CPU models.

The basic idea behind how the hybrid model to dead code elimination and runtime branching works, is most simply and elegantly expressed in ape/loader.c and libc/dce.c.

#define LINUX   1
#define METAL   2
#define WINDOWS 4
#define XNU     8
#define OPENBSD 16
#define FREEBSD 32
#define NETBSD  64

#define SupportsLinux()   (SUPPORT_VECTOR & LINUX)
#define SupportsXnu()     (SUPPORT_VECTOR & XNU)
#define SupportsFreebsd() (SUPPORT_VECTOR & FREEBSD)
#define SupportsOpenbsd() (SUPPORT_VECTOR & OPENBSD)
#define SupportsNetbsd()  (SUPPORT_VECTOR & NETBSD)

#define IsLinux()   (SupportsLinux() && os == LINUX)
#define IsXnu()     (SupportsXnu() && os == XNU)
#define IsFreebsd() (SupportsFreebsd() && os == FREEBSD)
#define IsOpenbsd() (SupportsOpenbsd() && os == OPENBSD)
#define IsNetbsd()  (SupportsNetbsd() && os == NETBSD)

So one of the flexibilities of APE is the build can be tuned like this:

make MODE=tiny CPPFLAGS+=-DSUPPORT_VECTOR=0b00000001

If you want a 4kb hello world binary that only runs on Linux, and leaves out that hefty 12kb of bloat that's normally included to support Windows, MacOS, FreeBSD, NetBSD, OpenBSD, and booting from BIOS on bare metal. One sweet spot is to only target the ELF operating systems. Simply set the appropriate bits for Linux + BSD.

make MODE=tiny CPPFLAGS+=-DSUPPORT_VECTOR=0b01110001

δzd Encoding

One of the most size optimized pieces of code in the Cosmopolitan codebase is Python. This is where we really had to go beyond code golfing and pull out the artillery when it comes to size coding. One of the greatest weapons in our arsenal (which helped us get statically linked Python binaries down to 2mb in size) is what we call the Delta Zig-Zag Deflate compression scheme, or δzd for short.

This isn't a generalized compression algorithm like DEFLATE, which employs Huffman coding (along with an RLE very similar to LZ4). It's only profitable on certain kinds of executable content and while programming we need to use our best judgement to figure out which technique is most appropriate. But δzd in particular has produced some considerable gains.

Consider the following dilemma. The Chinese, Korean, and Japanese languages are enormous. They don't even all agree on UTF-8 yet, and have a variety of one-off character sets, all of which are supported by Python. The UNICODE standard and its information tables also need to be enormous too, in order to support CJK. Most people outside those three countries don't need this information in their binaries, but we like to keep it there anyway, in the interest of inclusiveness and enabling languages like Python to erect a big tent that welcomes everyone.

By using δzd packing, we can make these lookup tables significantly tinier, thereby ensuring that the user only needs to pay for the features they actually use. One of the many symbols in the Python codebase that needed this treatment is _PyUnicode_PhrasebookOffset2 which was originally 178kb in size, but apply applying delta encoding, zig-zag encoding, and then finally DEFLATE, we got it down to 12kb. That's a nearly 10x advantage over DEFLATE's Huffman encoding alone, and BZIP2's Burrows-Wheeler transform!

_PyUnicode_PhrasebookOffset2: size         is      178,176 bytes
_PyUnicode_PhrasebookOffset2: packed size  is      100,224 bytes
_PyUnicode_PhrasebookOffset2: rle size     is      282,216 bytes
_PyUnicode_PhrasebookOffset2: deflate size is       52,200 bytes
_PyUnicode_PhrasebookOffset2: bz2 size     is       76,876 bytes
_PyUnicode_PhrasebookOffset2: δleb size    is       47,198 bytes
_PyUnicode_PhrasebookOffset2: δzd size     is       12,748 bytes

In third_party/python/Tools/unicode/makeunicodedata.py you can read more about how we regenerated these tables. The Python implementation for Delta-ZigZag-DEFLATE encoding is also as follows:

def uleb(a, x):
    while True:
        b = x & 127
        x >>= 7
        if x:
            a.append(b | 128)
        else:
            a.append(b)
            break

def zig(x):
    m = (2 << x.bit_length()) - 1
    return ((x & (m >> 1)) << 1) ^ (m if x < 0 else 0)

def zleb(a, x):
    return uleb(a, zig(x))

def δzd(data):
    n = 0;
    i = 0
    p = 0
    a = bytearray()
    for x in data:
        zleb(a, x - p)
        p = x
    return deflate(a), len(a)

You can read the C code that unpacks the Python compressed data in libc/x/xloadzd.c, libc/fmt/unzleb64.c, libc/runtime/inflate.c, and third_party/zlib/puff.c. Puff is particularly nice. It's a size optimized (but slower) version of zlib that Mark Adler wrote that only does DEFLATE decompression, and it has a binary footprint of 2kb. One thing Cosmopolitan Libc core libraries do quite frequently (since they can't depend on things like malloc) whenever the need arises for proper DEFLATE decompression, is strongly link Puff, and weakly link zlib (which is more on the order of 60kb in size) so that Puff is used by default, unless the faster zlib library is strongly linked by something else.

Overlapping Functions

If high-level programming languages like C are the Ice Hotel and assembly is the tip of the iceberg, then the hidden dimension of complexity lurking beneath would be Intel's variable length encoding. This is where boot sectors get esoteric real fast, since tools can't easily visualize it. for example, consider the following:

/	%ip is 0
	mov	$0xf4,%al
	ret
/	%ip is 1
	.byte	0xb0
wut:	hlt # and catch fire
	ret

Similar to how a Chess game may unfold very differently if a piece is moved to an unintended adjacent square, an x86 program can take on an entirely different meaning if the instruction pointer becomes off by one. We were able to use this to our advantage, since that lets us code functions in such a way that they overlap with one another.

/	SectorLISP code.
89 D6 Assoc:	mov	%dx,%si
8B 3C 1:	mov	(%si),%di
8B 30 	mov	(%bx,%si),%si
AF    	scasw
75 F9 	jne	1b
F6    	.byte	0xF6
8B 39 Cadr:	mov	(%bx,%di),%di
3C    	.byte	0x3C
AF    Cdr:	scasw
8B 05 Car:	mov	(%di),%ax
C3    	ret
89 D6          Assoc:	mov	%dx,%si
8B 3C          1:	mov	(%si),%di
8B 30          	mov	(%bx,%si),%si
AF             	scasw
75 F9          	jne	1b
F6 8B 39 3C AF 	testw	$0xaf,0x3c39(%bp,%di)
8B 05          	mov	(%di),%ax
C3             	ret

8B 39          Cadr:	mov	(%bx,%di),%di
3C AF          	cmp	$0xaf,%al
8B 05          	mov	(%di),%ax
C3             	ret

AF             Cdr:	scasw
8B 05          	mov	(%di),%ax
C3             	ret

8B 05          Car:	mov	(%di),%ax
C3             	ret

Optimizing Printf

One of the issues with printf() from a size coding perspective, is it pulls in a lot of heavy-weight dependencies:

Yet somehow, despite all the bloat in printf(), binaries that casually link it (e.g. examples/hello3.c) can be as small as 24kb in size. This is thanks to the PFLINK() macro.

/* see libc/fmt/pflink.h */
/* see libc/stdio/stdio.h */
#define printf(FMT, ...) (printf)(PFLINK(FMT), ##__VA_ARGS__)
#define PFLINK(...) _PFLINK(__VA_ARGS__)
#define _PFLINK(FMT, ...)                                             \
  ({                                                                  \
    if (___PFLINK(FMT, strpbrk, "faAeg")) STATIC_YOINK("__fmt_dtoa"); \
    if (___PFLINK(FMT, strpbrk, "cmrqs")) {                           \
      if (___PFLINK(FMT, strstr, "%m")) STATIC_YOINK("strerror");     \
      if (!IsTiny() && (___PFLINK(FMT, strstr, "%*") ||               \
                        ___PFLINK(FMT, strpbrk, "0123456789"))) {     \
        STATIC_YOINK("strnwidth");                                    \
        STATIC_YOINK("strnwidth16");                                  \
        STATIC_YOINK("wcsnwidth");                                    \
      }                                                               \
    }                                                                 \
    FMT;                                                              \
  })
#define ___PFLINK(FMT, FN, C) \
  !__builtin_constant_p(FMT) || ((FMT) && __builtin_##FN(FMT, C) != NULL)

This only works with GCC and not Clang. It works because the first argument to printf() is almost always a string literal, and GCC is smart enough to run functions like __builtin_strstr() at compile time, sort of like a C++ constexpr. Once GCC has identified that we need a heavyweight feature, it then does what we call "yoinking" the appropriate symbol.

Here's how yoinking works. There's a trick with ELF binaries called weak linking. The printf() implementation has code like this:

      case 'm':
        p = weaken(strerror) ? weaken(strerror)(lasterr) : "?";
        signbit = 0;
        goto FormatString;

When a function is weakened, then it won't get pulled into the final binary unless some other module references it the strong or normal way. Yoinking is one way to do that. However yoinking is intended for the special situation where don't have any code that actually uses the symbol, but we still want to pull it into the binary.

	.section .yoink
	nopl	symbol
	.previous

So what we do with the yoink macro is is generate a reference inside a section that's explicitly discarded by the linker script.

  /DISCARD/ : {
    *(.yoink)
  }

So even though that reference is ultimately discarded, it's still enough to cause ld.bfd to pull the referenced module into the final binary.

This isn't a perfect technique, even though it's a worthwhile one. Large programs will implicitly yoink everything on a long enough timeline. So for big programs, this isn't a problem. But it's sometimes necessary for tiny programs to help the linker out explicitly. For example, you may be committing the security sin of using non-literal format strings. In that case the macro magic will cause everything to be yoinked, since it has no visibility into the string at compile time. If you still want to stay on the tinier side, then you can disable the magic by calling printf() as such:

(printf)("last error is %m\n");
Then, any features you need which are listed in the PFLINK() macro can be yoinked explicitly by your main() module.
STATIC_YOINK("strerror");

Tiny Portable ELF

Please take a look at the How Fat Does a Fat Binary Need To Be? page which lets you interactively build online Cosmopolitan Actually Portable Executable binaries. As we can see, features like SUPPORT_VECTOR provide all the power and configurability any user should need, to make their programs as tiny as they want them to be.

But let's say for a moment you only cared about ELF targets like Linux, FreeBSD, NetBSD, and OpenBSD. Let's furthermore imagine that you wanted to build a binary that runs on all four of these operating systems, from scratch, without the assistance of APE and Cosmo. It turns out that's relatively easy to do, and this section will show you how it works. What's also cool is that it's only 386 bytes in size, using an idiomatic best practices approach.

You need three files: reallytiny-elf.S, reallytiny.c, and reallytiny.lds. Here's your assembly glue:

/* reallytiny-elf.S */
#define ELFCLASS64 2
#define ELFDATA2LSB 1
#define ELFOSABI_FREEBSD 9
#define ET_EXEC 2
#define EM_NEXGEN32E 62
#define PT_LOAD 1
#define PT_NOTE 4
#define PF_X 1
#define PF_W 2
#define PF_R 4

	.align	8
ehdr:	.ascii	"\177ELF"
	.byte	ELFCLASS64
	.byte	ELFDATA2LSB
	.byte	1
	.byte	ELFOSABI_FREEBSD
	.quad	0
	.word	ET_EXEC			# e_type
	.word	EM_NEXGEN32E		# e_machine
	.long	1			# e_version
	.quad	_start			# e_entry
	.quad	phdrs - ehdr		# e_phoff
	.quad	0			# e_shoff
	.long	0			# e_flags
	.word	64			# e_ehsize
	.word	56			# e_phentsize
	.word	2			# e_phnum
	.word	0			# e_shentsize
	.word	0			# e_shnum
	.word	0			# e_shstrndx
	.globl	ehdr

	.align	8
phdrs:	.long	PT_LOAD			# p_type
	.long	PF_R|PF_X		# p_flags
	.quad	0			# p_offset
	.quad	ehdr			# p_vaddr
	.quad	ehdr			# p_paddr
	.quad	filesz			# p_filesz
	.quad	filesz			# p_memsz
	.quad	64			# p_align

#if 0  // will break openbsd unless we actually use bss
	.long	PT_LOAD			# p_type
	.long	PF_R|PF_W		# p_flags
	.quad	0			# p_offset
	.quad	bss			# p_vaddr
	.quad	bss			# p_paddr
	.quad	0			# p_filesz
	.quad	bsssize			# p_memsz
	.quad	64			# p_align
#endif

	.long	PT_NOTE			# p_type
	.long	PF_R			# p_flags
	.quad	note - ehdr		# p_offset
	.quad	note			# p_vaddr
	.quad	note			# p_paddr
	.quad	notesize		# p_filesz
	.quad	notesize		# p_memsz
	.quad	8			# p_align

note:	.long	2f-1f
	.long	4f-3f
	.long	1
1:	.asciz	"OpenBSD"
2:	.align	4
3:	.long	0
4:	.long	2f-1f
	.long	4f-3f
	.long	1
1:	.asciz	"NetBSD"
2:	.align	4
3:	.long	901000000
4:	notesize = . - note

_start:	mov	%rsp,%rsi
	jmp	Start
	.globl	_start

Here's your C application and C library:

/* reallytiny.c */
#define LINUX   1
#define OPENBSD 16
#define FREEBSD 32
#define NETBSD  64

#ifndef SUPPORT_VECTOR
#define SUPPORT_VECTOR (LINUX | FREEBSD | NETBSD | OPENBSD)
#endif

#define SupportsLinux()   (SUPPORT_VECTOR & LINUX)
#define SupportsFreebsd() (SUPPORT_VECTOR & FREEBSD)
#define SupportsOpenbsd() (SUPPORT_VECTOR & OPENBSD)
#define SupportsNetbsd()  (SUPPORT_VECTOR & NETBSD)

#define IsLinux()   (SupportsLinux() && os == LINUX)
#define IsFreebsd() (SupportsFreebsd() && os == FREEBSD)
#define IsOpenbsd() (SupportsOpenbsd() && os == OPENBSD)
#define IsNetbsd()  (SupportsNetbsd() && os == NETBSD)

__attribute__((__noreturn__)) static void Exit(int rc, int os) {
  asm volatile("syscall"
               : /* no outputs */
               : "a"(IsLinux() ? 60 : 1), "D"(rc)
               : "memory");
  __builtin_unreachable();
}

static int Write(int fd, const void *data, int size, int os) {
  char cf;
  int ax, dx;
  asm volatile("clc\n\t"
               "syscall"
               : "=a"(ax), "=d"(dx), "=@ccc"(cf)
               : "0"(IsLinux() ? 1 : 4), "D"(fd), "S"(data), "1"(size)
               : "rcx", "r11", "memory", "cc");
  if (cf) ax = -ax;
  return ax;
}

static int Main(int argc, char **argv, char **envp, long *auxv, int os) {
  Write(1, "hello world\n", 12, os);
  return 0;
}

__attribute__((__noreturn__)) void Start(long di, long *sp) {
  long *auxv;
  int i, os, argc;
  char **argv, **envp, *page;

  // detect freebsd
  if (SupportsFreebsd() && di) {
    os = FREEBSD;
    sp = (long *)di;
  } else {
    os = 0;
  }

  // extract arguments
  argc = *sp;
  argv = (char **)(sp + 1);
  envp = (char **)(sp + 1 + argc + 1);
  auxv = (long *)(sp + 1 + argc + 1);
  for (;;) {
    if (!*auxv++) {
      break;
    }
  }

  // detect openbsd
  if (SupportsOpenbsd() && !os && !auxv[0]) {
    os = OPENBSD;
  }

  // detect netbsd
  if (SupportsNetbsd() && !os) {
    for (; auxv[0]; auxv += 2) {
      if (auxv[0] == 2014 /* AT_EXECFN */) {
        os = NETBSD;
        break;
      }
    }
  }

  // default operating system
  if (!os) {
    os = LINUX;
  }

  Exit(Main(argc, argv, envp, auxv, os), os);
}

Here's your linker script:

/* reallytiny.lds */
ENTRY(_start)

SECTIONS {
  . = 0x400000;
  .text : {
    *(.text)
    *(.rodata .rodata.*)
  }
  filesz = . - ehdr;
  textsz = . - _start;
  .bss ALIGN(4096) : {
    bss = .;
    *(.bss)
    . = ALIGN(4096);
  }
  memsz = . - ehdr;
  /DISCARD/ : {
    *(.*)
  }
}

bsssize = SIZEOF(.bss);
textoff = _start - ehdr;

You can then build your program as such:

gcc -static -no-pie -g -Os \
  -o reallytiny.elf.dbg \
  -Wl,-T,reallytiny.lds \
  reallytiny-elf.S \
  reallytiny.c
objcopy -S -O binary \
  reallytiny.elf.dbg \
  reallytiny.elf

Here's your binary visualized:

[reallytinyelf.png]

Here's what happens when you run it:

linux$ ./reallytiny.elf
hello world
freebsd$ ./reallytiny.elf
hello world
netbsd$ ./reallytiny.elf
hello world
openbsd$ ./reallytiny.elf
hello world

Four operating systems is under 400 bytes is pretty good!

One thing you may have noticed is that BSDs don't strictly follow the System V Application Binary Interface w.r.t. the SYSCALL instruction. For example, NetBSD always clobbers RDX. FreeBSD might clobber R8 and R9 sometimes. But most importantly, the carry flag is used to indicate an error occurred, rather than doing the thing Linux does (as specified by System V) where an error returns a negative errno. The simplest way to tell the two apart, is to use the clc instruction before invoking syscall. This is needed because Linux actually saves the processor flags, so we need to make sure it isn't set beforehand. Then, if the carry flag is set after issuing the system call, then it must be due to a BSD error. In that case, we simply flip the sign of the RAX register.

Detecting the operating system from the C code is a little ugly, but it's nice that we're able to do it without issuing any system calls. For instance, OpenBSD doesn't have any auxiliary values (and hopefully never will!) so we've got that base covered. FreeBSD likes to set RDI to RSP so that the assembly thunk isn't needed. NetBSD likes to use very large ordinals for its auxiliary values when compared to Linux, which is the default. One other interesting fact about the _start() function is that all operating systems will clear the other registers, however NetBSD only does it for callee-used registers. So R12, R13, etc. might be some random value on NetBSD.

Another unfortunate thing about the system call ABI is that, with its x86-64 support, Linux decided to arbitrarily diverge from the pack. The ordinals 1 and 4 for exit and write respectively, actually haven't changed in the entire history of the functions. Those very same ordinals were used in the original Bell System V codebase.

Funding

[United States of Lemuria - two dollar bill - all debts public and primate]

Funding for this blog post was crowdsourced from Justine Tunney's GitHub sponsors and Patreon subscribers. Your support is what makes projects like Cosmopolitan Libc possible. Thank you.

See Also