20200727

When Unix learned to reboot(2).

History of Reboot(2)

Recently, a friend asked me the history of halt, and when did we have to stop with the sync / sync / sync dance before running halt or reboot. The two are related, it turns out.

That sync; sync; sync Thing...

If you go looking around the net, you'll find some people giving advice like "when shutting down, type 'sync; sync; sync; halt' to be safe." There's good reasons behind this advice which aren't immediately clear and are interesting to explore. Before exploring, I'd been told that the reasons for the sync dance were a driver bug in v6 that's been fixed 45 years ago... But it turns out whoever told me that must have been mistaken because the code tells a different story...

The sync program called the sync system call and exited (and still does). The sync system call in research editions of unix was implemented as approximately:
foreach mountpoint
    write superblock with bwrite
for each dirty inode
    write the inode with iupdat
bflush
 
It would step through the fixed list of buffers in the system, writing the dirty ones out. It used bwrite() to do this, which was synchronous.  Each write had to complete before the next one started. iupdat will read the inode off the disk, update that inode, and write out, again synchronously. bflush writes everything with bwrite, but marks the buffers as B_ASYNC which means in that case it won't wait. And nothing else waits either. So, recommendation for typing sync 3 times, one line at a time, was to give time for the buffers to drain (the subsequent syncs would schedule no new I/O on a quiet system). Typing all three on one line with semicolons, didn't give this time...

If you look at the recommendation, it's actually quite smart. Typed one line at a time, waiting for the prompt each time, would schedule a lot of I/O the first time, then give the operator a harmless task to do for a few seconds that would allow the I/O to complete before they did anything. The kernel avoided all kinds of nasty deadlocks that later systems would face when they implemented waiting for the I/O to complete.

Edit: One bit of lore that was passed on to me was the first sync returned right away, but the second one blocked.... I've found no evidence of that in BSD or System V based systems... although there is an increasing amount of protection against multiple threads being in the sync code as concurrency in Unix increased.

Why not do this in reboot(2)?

None of the versions of Research Unix had a system call to reboot. To restart things, one killed init with SIGHUP, which would in turn kill everything else and fork a new shell in single user. There was no other way to restart the system, and bad things happened if init actually died. But there was no clean reboot option, nor any way to stop the kernel cleanly (apart from the power switch).
Looking at the sources, there was one small hint that something was planned, but never executed. All the system calls were defined in /usr/include/sys.s. A close examination shows the following:
lock    = 53.
ioctl   = 54.
reboot  = 55.
mpx     = 56.
setinf  = 59.
which proves me wrong, right? Well, maybe not. Looking at sysent.c, we see the following:
1, 0, syslock,          /* 53 = lock user in core */
3, 0, ioctl,            /* 54 = ioctl */
0, 0, nosys,            /* 55 = readwrite (in abeyance) */
4, 0, mpxchan,          /* 56 = creat mpx comm channel */
0, 0, nosys,            /* 57 = reserved for USG */
0, 0, nosys,            /* 58 = reserved for USG */
3, 0, exece,            /* 59 = exece */
which lists 'nosys' as the handler, so there's no implementation. There's no reboot system call. I'll also note that there's another disconnect: system call 59 is listed as setinf (whatever that is), but is implemented as exece.

Enter 4BSD

The first reference to reboot(2) I can find is in 4.0BSD in sysent.c, we see the following;
3, 0, ioctl,            /* 54 = ioctl */
1, 0, reboot,           /* 55 = reboot */
4, 0, mpxchan,          /* 56 = creat mpx comm channel */
where reboot landed in the slot allocated for it. There's a new command in /etc/ that calls it (by the syscall number, not the normal wrapper). It wasn't in 2.8BSD, but is also present in a similar form in 2.9BSD and later. 3BSD still has the placeholder pointing at nosys. Since the 2BSD evolution tracked 4BSD, I'll not call it out further.

In 4.0BSD (1980), reboot() just call a machine dependent boot() routine. It called update(), which scheduled the writes as described above. It printed that it was waiting for the IO to finish. However, the 'wait for it' code was basically 'sleep(5)' and so if all the data didn't get out in 5 seconds, bad things would happen. So the "sync sync sync halt" dance was still useful advice. It would get thee ball rolling and easily double the amount of time that the data had to make it to the disk, depending on the typist... 4.1c (1982) bumped this to 10s and had ifdef'd out code to try to wait for all the dirty buffers to clear.

In 4.2BSD (1983), the wait for the writes code is engaged. It tries up to 20 times to walk the list of bufs in the system to let the buffers drain. So progress is made here. 4.3 adds a delay that's 40ms * iter (total of 8.4s)... which remains through at least 4.4BSD (1993)...  So while things got better, so did systems, and more and more I/O could be piled up. Successor BSD systems improved on this as well, including various ways to solve it (mostly when systems got big enough so update(8) couldn't flush all the I/O in 30s before the next sync call started).

AT&T Unix

Meanwhile, on the AT&T side of things...  System III (1980) didn't have anything. System Vr1 doesn't have anything. None of the Programmer Work Bench releases (PWB) have a reboot(2).

System Vr2 (1984) defined a new uadmin(2) system call. It acted like an indirect system call (you passed it what you wanted it to do as the first arg). One of these, A_REBOOT, is called from fsck, but the kernel doesn't implement it.

Fast forward to System Vr3 (1987) and we find an implementation. It's basically a call to umount for the / filesystem followed by a call to reset the CPU. The other filesystems are unmounted as part of shutdown process before uadmin(A_REBOOT, ..) gets called, so only / remained mounted by the time it's called. This call to umount flushes the dirty buffers, and cleanly unwinds everything else so nothing is pending when the call to the CPU reset happens. So finally the 'sync sync sync halt' problem had been solved... Well, maybe... There's no timeout, nor any way to avoid deadlock. Still, a fairly clean solution to the issue, especially relative to what BSD was doing at the time.

Commercial Unix

The availability of even leaked source from the early days. By the time SunOS gets to 4.1, however, there was a vfs_syncall() which the MD boot() function called to synchronize everything (it only returned when all the scheduled I/O was done). I've not checked to see if there was a halting problem here or not, but empirical evidence of rebooting a lot of suns suggests that it was rarely a problem in practice if any of the I/Os somehow got wedged...  Or that I was lucky enough not to have a large enough fleet of machines where flaky disk problems start to show up... I can't find any earlier versions of the Sun sources, so it's hard to know when this solution entered into the tree (the code I have looked at dates from 1994, 11 years after the initial release).  I suspect Sun solved this problem early, but have no proof of this beyond a hunch.

Other Unixes, that aren't just System V ports, are hard to find in source form, so I can't say from original sources whether or not they solved the issue or not prior to System Vr3. There were also a lot of 4.2BSD and 4.3BSD ports that didn't survive either to be examined.

The one exception to this general rule was a copy of the Unisoft 1.0 kernel I found on bitsavers. It dates from 1986 (so relatively late). It has a reboot system call (number 64, not 55). That system call calls update(), like 4BSD's, but then does a big for() loop (1 to 1,000,000) before calling a routine that resets the CPU. This kernel is a V7 port, and most likely got the idea (and maybe the code) from one of the 4BSD releases. This kernel appears to be the basis Sony's SUNIX (which appears to be a 7th Edition-based Unix that predated SONY's NEWS-OS based on 4.2BSD). NEWS-OS likely behaved like 4.2BSD, but I can't confirm that due to lack of sources. If you know more info about SUNIX or NEWS-OS, please leave a comment.

Linux

Linux's sync call is synchronous. You get the same guarantees as you do from fsync. This behavior was introduced in 1.3.20, released in 1995. Prior to that the same sync dance advice was useful since early versions of Linux were more aggressively asynchronous in their handling of disk writes than other contemporary systems. While this helped it compete in benchmarks, it caused data integrity problems when Linux machines started to be put into production (which was one of the reasons motivating the change). Modern Linux systems flush out all the dirty buffers as part of the shutdown sequence and wait for the flush to complete before proceeding to reboot, turning the system off or halting.

Conclusion

For years, I'd been told the reasons for the 3 sync dance was due to a driver bug, long since fixed, in a DEC disk driver in v6. However, digging into it shows that there were decent reasons for doing this dance, even after Unix learned to reboot() itself.

2 comments:

  1. David Holland8:45 PM

    My recollection is that the three syncs are a linux thing for some other reason, and prior to that the standard was two ("sync; sync; halt"). But I can't remember for sure.

    Anyway, I too had heard the explanation that the second sync would block, because it would have to cycle the busy bit on each buffer and that would stop on whichever buffer was currently in transit. But I guess that's wrong, because after waiting for one block nothing except scheduler luck would stop it from then going on past, so you'd need more than just two (or three) syncs.

    More seriously, though, it looks like in v7 at least sync doesn't wait for the busy bit -- after it flushes all superblocks and all inodes, it goes into bflush(), which goes through and sets B_BUSY | B_ASYNC willy-nilly on every dirty buffer... hijacking it and causing it to be brelse()'d arbitrarily if some other process is in the middle of using it. I'm surprised this didn't lead to serious problems. (Or, maybe it did...)

    ReplyDelete
  2. I've heard the second sync would block over the years as well for various systems. I've found no systems where that I can examine the source where the system blocks the start of the second sync waiting for the first to complete. But looking at the code, what actually happens is that all the B_LOCKED bios are simply skipped.

    B_BUSY means that there's I/o in flight (so setting that is right) and B_ASYNC means that there's no wait called after submission.

    V7 differs from modern unix in that it has no mmap() system call. Only write will dirty things, and the filesystem does things in a way such that a buffer can be brelse() once it relinquishes control. V7 was also single threaded, run to completion inside the kernel, so all the places that could sleep protect against this possibility.

    The two things that complicate today's kernels (mmap and multi-processing) weren't a concern at all in the V7 kernel.

    ReplyDelete