20160109

Details on coming automatic module loading in FreeBSD

Automatic Module Loading

For a long time, I've wanted to add better, automatic module loading to FreeBSD. This past year, I started implementing that feature. Time and work pressures prevented me from completing it.

Some background

Every device that we have in our systems is enumerated in one of two ways. Either the bus gives us a list of all the devices, with certain per-device attributes (called plug and play data), or the device is attached through some other means. These latter devices are beyond the scope of this work, and there's generally very few of them in the system, and they aren't optional. The other devices, sometimes called 'self-enumerating' devices, have enough plug and play data for drivers in the system to decide if that driver can drive them or not. Most operating systems assign devices to drivers using this data. Some have the data encoded into tables in the filesystem (Windows, OS  X and Solaris), while others encode the data into the drivers (FreeBSD, NetBSD, OpenBSD, Linux, Dragon Fly BSD), though Linux offers some hybridization when it comes to certain devices. Except for FreeBSD, all these other OSes are beyond the scope of this work. And there's certainly room for debate over which approach is best, but we'll leave that behind as well.

Some busses on FreeBSD, like USB and PC Card, have very stylized probe routines. Drivers for devices on these busses generally call some bus-provided routine to match the device against a table that's basically the same for all drivers (though there's usually some stylized way to attach extra data). These drivers are easy to adapt to this new scheme because while some custom code needs to be written for each bus, each driver of that bus can generally use a macro to implement marking the PNP data (more on what marking means in a bit).

Other busses, like PCI, leave it entirely to the driver. So most of the drivers in the system have written their own matching routines that essentially loop through a table matching some attribute the bus provides to decide if the device is for them. These busses are harder to adapt.

But either way, most all drivers have some table of plug and play data that they use to filter their choice of devices. Since drivers are compiled into modules, this means most modules on the system have this data too. However, given the chaotic nature of the different busses, it's impossible for a program to find this data, unless it has very specific knowledge of each and every driver on the system (though for some classes of drivers, like USB and PC Card, it needn't know everything).

Finally, switching gears a bit, we have modules. Modules in the system record dependencies on other modules in the system using special macros. When the modules are installed, kldxref(8) runs through all of them, extracting these dependencies into a file called linker.hints that lives in the directory kernel and modules reside in.

General Theory

So, with that background, it's time to explore the design. What if we could take that chaotic state of affairs and somehow tame it. If we could create macros that we could use to mark the plug and play data, as well as associate various binary bits with the plug and play attribute provided by the bus driver? What if we could create records in the modules similar to what is used to mark module dependencies? Then, kldxref(8) would be able to comb through this data, record it in linker.hints and we'd need little other modifications to the system to make this data readily accessible.

That's exactly what my changes do. New types of records are inserted into the special section of modules that describe the plug and play table in the driver. These records contain a header that contains the length of each table entry, a pointer to the first entry, and a tiny little "script" or "description" of the table that ties this binary data to the bus-provided plug and play data. Each bus in the system that is of the stylized type described above defines its own macros to help its client drivers mark the data. Since all the data is the same for all the drivers, this means client drivers don't need to reinvent the wheel. Since we also pass the length of each table entry, drivers can use the common pattern of having the common data first, followed by whatever other data it needs for each device in the table.

However, that leaves more work for the non-stylized busses. While a few of these drivers have been converted, many remain.

Details about the marking

The first place to look for the details is . The first user-visible bit is a macro MODULE_PNP_INFO
#define MODULE_PNP_INFO(d, b, unique, t, l, n)
"d" is the description of the table (more on that below). "b" is the name of the bus. Unique is a unique string (typically the driver name). "t" is a pointer to the plug and play table. "l" is the length of each entry in the table. "n" is the number of entries.

The description is of the general form  (TYPE:pnp_name[/pnp_name];)* where TYPE is one of the following:
  • U8      uint8_t element
  • V8      like U8 and 0xff means match an
  • G16    uint16_t element, any value >= matches
  • L16     uint16_t element, any value <= matches
  • M16    uint16_t element, mask of which of the following fields to use.
  • U16     uint16_t element
  • V16     like U16 and 0xffff means match any
  • U32     uint32_t element
  • V32     like U32 and 0xffffffff means match any
  • W32     Two 16-bit values with first pnp_name in LSW and second in MSW
  • Z       pointer to a string to match exactly
  • D       like Z, but is the string passed to device_set_descr()
  • P       A pointer that should be ignored
  • E       EISA PNP Identifier (in binary, but bus publishes string)
  • K       Key for whole table. pnp_name=value. must be last, if present.
The pnp_name "#" is reserved for other fields that should be ignored.

All this is a bit complicated. However, it can be simplified greatly for the buses that are stylized. For example PC Card, the PCCARD_PNP_INFO macro just takes a pointer to the first element and figures the rest out from there (you can see how in sys/dev/pccard/pccardvar.h).

kldxref(8)

Kldxref(8) has been modified to find all these entries. The above description is fairly complex, but covers all known tables in the current system. kldxref takes the above, and filters it into a much smaller subset by expanding different fields in a format more suited to quick parsing. It parses into the following format:
The format output is the simplified string from this routine in the same basic format as the pnp string, as documented in sys/module.h (and above). First a string describing the format is output, the a count of the number of records, then each record. The format string also describes the length of each entry (though it isn't a fixed length when strings are present).
type    Output          Meaning
   I       uint32_t        Integer equality comparison
   J       uint32_t        Pair of uint16_t fields converted to native byte order. The two fields both must match.
  G       uint32_t        Greater than or equal to
   L       uint32_t        Less than or equal to
  M       uint32_t        Mask of which fields to test. Fields that take up space increment the count. This field must be first, and resets the count.
   D       string          Description of the device this pnp info is for
   Z       string          pnp string must match this
   T       nothing         T fields set pnp values that must be true for the entire table.

Values are packed the same way that other values are packed in this file. Strings and int32_t's start on a 32-bit boundary and are padded with 0 bytes. Objects that are smaller than uint32_t are converted, without sign extension to uint32_t to simplify parsing downstream.

Current State

While the recording side is fairly well finished and committed to the tree, the rest of it is still up in the air. This section describes suggested future work for anybody wishing to help.

The easiest thing to do is to convert a few drivers to record this info. This conversion usually goes fairly quickly after you've found a similar driver that's been converted. Some drivers "save" space by matching the vendor code, for example, in code, while the device is matched from a table. When converting these drivers, you need to add the vendor code to each line in the table, and modify the code to get the vendor from the table.

Some buses have few enough drivers that it would be beneficial to adapt them to a stylized bus. simplebus being an obvious candidate. While most of the SoCs that have simplebus use a monolithic kernel, it's never too early to plan for a more generic and modular future. Some work in this area is in review.

There's three different strategies to use this data. First, the boot loader already reads in the linker.hints file. It could be modified to parse this data (it currently ignores it) and look at the PCI devices in the system. This leaves a number of holes, however, and loading drivers from the bootloader currently has significant performance issues. Second, the kernel could parse this file and load drivers as needed. However, this is far from straight forward in the kernel, since module loading needs to be queued until after / is available, and even after boot, some insertion events may happen in contexts that won't allow for modules to be loaded directly. Third, a user land program (perhaps devd(8)?) could parse the loader.hints file and create devd.conf scripts. USB currently has a program that will generate its hints based on ELF sections, a design which informed the current implementation. Its generator knows the format, which the current design hopes to avoid.

So a good start has been made, but more work is needed before we can ship only a MINIMAL kernel with modules. See this space for more info in the future.

3 comments:

Anonymous said...

What an awesome blog post Warner!

-Alex

Unknown said...

Why not skip the part where devd configurations are created and just have a process that reads the hints and listens on /var/run/devd.pipe. It would spare us spamming /etc/devd/ with huge and trivial files.

Ali said...

Thanks for the awesome post.

I think FreeBSD needs in this case some kind of module blacklists. This way broken modules that get loaded automatically can be somehow easily disabled.

-Ali