Traditional Unix Toolchains

Older Unix systems tend to be fairly uniform in how they handle the so-called 'toolchain' for creating binaries. This blog will give a quick overview of the toolchain pipeline for Unix systems that follow the V7 tradition (which evolved along with Unix, a topic for a separate blog maybe).

Unix is a pipeline based system, either physically or logically. One program takes input, process the data and produces output. The input and output have some interface they obey, usually text-based. The Unix toolchain is no different.

Overview

Here's a simplified view of what's going on. We'll add more detail later.

In this view, the C compiler takes .c code and turns it into assembler. How it does that, and how it optimizes, etc is for another blog post. Once the assembler is created, it's passed to as(1) which translates the assembler into .o files. The .o files contain the binary representation fo the assembler, plus a lot of metadata about it: what addresses correspond to what symbols, how to relocate the raw assembler when connected together, various debugging information (sometimes) and what section each bit of data resides in. You cannot directly execute a .o file. ld takes all the .o files and produces an executable (the default name of which is a.out). a.out files are executable. They happen to be in the same format as the .o files, except they have a different magic number which tells the kernel how to load them into memory and initialize the CPU's registers for that program.

Program Layout

In traditional unix, there were only three sections to a program. There were no shared libraries or other fancy things done by the linker (such as linker sets). The world view was much simpler. There were three sections, each one had a size. There was the 'text' section. This was the executable code. There was the 'data' section, which contained initialized data. And there was the 'bss' section which also contained data that was initialized to 0.

The heap resides above the bss and is managed by the unix sbrk(2) system call. The so-called "break" is set to the end of the bss segment (often referred to by the symbol ebss). Malloc(3) is built on top of sbrk(2) and will manage returning bits to the OS when it can.

For the PDP-11 and other segmented architectures, there can be complications. There can be separate I&D space (instruction and data) so that each one resides in it's own address space. This helps PDP-11 programs break the 64k limit. In addition, there can be overlays. Overlays are 8k segments that are mapped into the address space as needed to increase the text size of the program. The linker handles much of this, but the programmer must specify the overlap groups. Each group can be no more than 8k in size, and the main program can be no more than 56k in size (and the overlay manager uses 8k of the data segment as well). Programs in unix tend to not use overlays, but the kernel makes heavy use of them.

Compiler (cc, f77, etc)

Compilers create assembler output. Compilers, like the C compiler, may invoke other programs to do this. The C compiler runs the source through cpp to create an intermediate file (.i files) that it runs through the first pass of the C compiler. There are a number of other passes of the compiler that take the initial output and optimize it in various ways (usually by parsing and rewriting assembler). The final output of the compilers, at least in this era, is always textual assembler.

Assembler (as)

The assembler is the only thing in the system that creates .o files. Well, not strictly true since ld can also take .o files and produce a .o file from it, but true enough. The assemble takes the textual assembler and creates a .o file from that. The .o file includes information about how to relocate it (ld uses this info), what symbols go where, what bits are in the text section, what is in the data, etc.

Archive Files (ar)

It's fairly expensive to open a lot of files on Unix, especially if they are small. It's also inconvenient to carry around a number of different files to implement something. So unix ld also supports reading files in from an archive. An archive adds some headers to describe the file and then places a copy of the file into the archive. Unix has had a number of different archive formats (I could do a mini-blog entry on all of the ones through 4BSD), but conceptually they are all the same. The ar(1) program is used to create and manage .a files for this purpose.

One motivation for the library is to save space. If it's efficient to create a lot of tiny .o files, then the loader can only bring in what's needed saving space. When the address space is only 64k (or 64k+64k for separate I&D machines), every little bit helps. The archiver removes the overhead of having to do directory lookups on hundreds of files by creating a container for those objects that ld doesn't have to process by name.

When processing through a library, ld looks at each .o that's in the archive for symbols and includes it if it finds any that are needed for the current image so far. It just does one pass through an archive, though. This means if you have foo.o and bar.o in the archive and foo.o depends on bar.o somehow, foo.o needs to come first so that ld can find bar.o later. If they are in the other order, then ld may pass over bar.o entirely and then when it is processing foo.o, it will not go back and look for it.

lorder(1) and tsort(1) are used to try to optimize the order of the .o's in the library to make it possible to just do one pass through the library (though when you have circular dependencies, you're back to this same issue). lorder uses nm(1) to read all the .o's on the command line and produce dependencies. tsort takes these dependencies and sorts them into a list in the proper dependency order where possible. When cycles exist, it produces an order that minimizes passes required to resolve them all.

That sounds quite inconvenient, and it is. libc.a, especially in newer versions, has a number of circular dependencies that a single pass fails to resolve. One can work around this issue by specifying libc multiple times (which is unsatisfying, even if it doesn't produce a binary with two copies of everything), or do something else. The something else involves adding a table of contents to the library. A program called ranlib will read through an archive creating an index of defined symbols that points to the offset in the file that he .o with that symbol is present at. It does this by creating a first member of the archive named __.SYMDEF and placing its table in there. The format of the table is something that ranlib and ld agree on. ld uses this table to include files that are needed and to be able to seek backwards easily. In effect, this is the same as ld doing two passes over an archive (one to build this index, and one to process it), but in effect precomputes the first pass to keep ld simpler.

Loader (ld)

The linker can operate in two different modes. The first mode most people are not familiar with. In this mode, it will take .o files and partially link them together to produce a new .o file. libc uses this mode of operation to create all the assembler glue to call system calls and optimize out some of the local labels that the assembler produces and expects ld to optimize away. Since these files are consulted so often, the build process of libc does it at build time so that every invocation of ld later can be faster.

The second mode is the mode people are more familiar with. In this mode, ld combines a number of .o and .a files to create an executable (a.out by default, so that's what people call executables in general). a.out binaries lacked shared libraries, so ld produced the final output. This was both good and bad. It wasted space with all those copies of libc, but it also produced self-contained binaries that didn't need any external libraries to work. The PDP-11 didn't really have good demand paging hardware, like the later VAX machines, so was a poor fit to shared libraries. Shared libraries generally rely on mmap(2) working and larger address spaces to map the libraries in at. mmap(2) requires a page-grained MMU, which didn't usefully exist on the PDP-11 (it had 8k segments, which was far to large a percent of the whole address space to be useful). One good thing about binaries being self contained means that if the kernel can run the system calls that are in it, the binary will work. This has allowed PWB systems to be able to execute both v6 and v7 binaries, despite the two having different system call interfaces.

The Kernel

Speaking of the kernel, the kernel is the last step in the toolchain, or can be thought of as such. The kernel reads in the headers from the a.out files, and sets up the address space for the process when a new a.out binary is exec'd. It uses the layout I showed above, or some variant of it, to set things up, to populate memory and makes whatever arrangements with the MMU to protect the pages from other processes (if possible, some systems like an 8086 don't have MMUs but do have segments so can fake all this except the memory protection benefits). For separate I&D space binaries, it also sets up the segment registers for that to work.

The stack is also initialized. The detail of exactly where it goes varies somewhat. It's usually located with the data segment since it holds data almost exclusively. Stacks in this era were usually quite small. This tended to drive programs that had shallow call graphs and that made use of more global variables than a more modern style would suggest. All these things conserved stack space, though it's generally agreed today that it required more effort to read and understand because the context is spread out over more parts of the program than more modern coding practices tend to produce.

Conclusion

Without the complications of shared libraries, or link time optimizations, the tools of this area tended to be rather simple. They had simple interfaces between them. There were good boundaries between the different components. This limited the number of programs with knowledge of the formats for the different layers. Due to this limited spread of knowledge, switching out different parts for other parts often could be done without changing components that didn't directly know about the object format. This also produced simpler programs that used different engineering tricks to get the most performance out of the limited hardware of the day. The PDP-11s were approximately 0.1-0.5MIPS machines in this time frame with super slow I/O paths. This is about 100,000 times slower than most computers people interact with today. One advantage of the thoughtful engineering trade offs is that all the pieces are relatively easy to understand.

The modern ecosystems that we have today are more complex. ELF came along in the 90s and obsoleted the text, data, bss world view. shared libraries made huge programs, like X11, feasible. Today, clang bypasses the separate assembler stage and generates .o files directly. The llvm linker, lld, can optimize binaries between modules to produce better code. All these new features added complication to a simple model. While I morn for the loss of simplicity, I've become too used to the rich features they provide to want to go back. Understanding the roots of this complexity, though, helps to understand some of the weird quirks that persist, even to this day.

And speaking of weird quirks, I'd like to end with 'bss'. It's a 1950s IBM assembler mnemonic for 'block started by symbol' and was used to create storage that was associated with a symbol, but had no initial value. Today, 'bss' is no longer that, exactly. Its origin has been lost, for most people, in the sands of time and now it just means 'zeroed storage area'. So this very Unix centric term actually predates Unix by 10 or 15 years for a machine that Unix wouldn't run on until it was 10 or 15 years old... Here's a snapshot from the IBM assembler manual, available from the UA-SAP wikipedia page

showing the original source...

[[ This blog edited to include snapshot of the BSS manual entry ]]

Warner's Random Hacking Blog

20200716