Please join me at my new location bryankyle.com

Tuesday, May 18, 2010

An Elegant Design: Maildir

One of the most elegant software designs I've seen is that of Dan Bernstein's Maildir. Maildir is a way of storing email messages common to Unix platforms. It's the successor to an earlier format called MBox. These two formats differ in a very big way. To better understand the driving factors behind Maildir it's important to know the flaws that it's trying to address. The flaw in MBox, it turns out, is baked into its design and cannot be fixed without throwing out the design and starting fresh; which is exactly what Dan did with Maildir.

The MBox format keeps all messages in a single text file in the user's home directory, or anywhere for that matter. Each message is delimited by a From line, that is, a line that starts with from:. As you can imagine, this design makes is easy to count the number of messages in the file by counting the number of from lines. Another benefit of this format is that since messagse are appended on to the end of the file all of the messages are in chronological order. These benefits, are not without their drawbacks.

Since MBox stores everything in a single file that file needs to be locked anytime a process wants to write to it. That means that message delivery agents (MDAs) need to lock the file while accepting mail for a user limiting the total number of messages that can be accepted for a user at a given time to one. Locking isn't the only problem that occurs when you store lots of information in a single file. When a message user agent (MUA) wants to remove a message from the middle of the file it has to rewrite the entire file without the message that the user wants to delete. Along the same lines, in order to get a listing of the messages in the mailbox the MUA must scan the entire file from the first byte to the last -- apparently listing and deleting messages were not considered a common use case. Another problem, albeit one that only reveals itself in extreme cases, is that if a mailbox gets extremely large it could potentially run up against the operating system's file size limitations. Lastly, the file format is extremely fragile -- an errant From: line coult corrupt the mailbox and render the mailbox unreadable to MUAs. When designing systems it's important to balance speed, efficiency, robustness, and simplicity; MBox seems to be designed for simplicity at the expense of others.

Clearly the bar is set fairly low for other mailbox formats. Luckily, Maildir wasn't designed as an incremental improvement over MBox, it was designed to maximize speed, efficiency, and simplicity; and it did so in an amazingly elegant and robust way.

Enter Maildir

Maildir, as the name implies stores its data in a directory structure. This is a crutial difference between Maildir and MBox. Maildir, unlike MBox purports to be a lock-free system -- the mailbox can be read and written to at the same time. Under high load this is a really good feature to have. The most interesting piece is that Maildir maintains its lock-free design even when the mailbox is stored on a remote system and mounted using NFS (network file system).

Not only is the system lock-free it also doesn't fall victim to the other problems faced by MBox; receiving and deleting messages are O(1) operations -- meaning that they take a constant amount of time to complete, instead of an amount of time proportional to the amount of messages in the mailbox. Another advantage to storing the mailbox as a directory is that the number of messages can be found by counting the number of files in the mailbox directories -- the MUA doesn't need to open the files at all unless it wants to get information from them.

So how does it work? Maildir is a fairly simple system which has the nice side effect of making it robust. A Maildir mailbox is a directory that contains 3 subdirectories to store messages: tmp, new, and cur. Messages that are being delivered are placed into tmp while they are being received. Once messages have been received they are moved to new before being read. After a user reads a new message it's moved into cur.

So far this seems like a fairly simple system; but what are the files that contain the messages called? Each message is given a unique name. The unique name is made up of a combination of machine and process dependent information such as the host name, current time, PID of the MDA delivering the message, etc. By concatenating a few of these together a globally unique name can be created; in the event that a collision occurs Maildir specifies that the MDA wait at least one second before trying again. This timeout is meant only to ensure that enough time has elapsed that the data used to create another unique name will be different.

That's effectively all there is to the mailbox format. Certainly it's simple, but how is it elegant or robust? To get a handle on that it's important to have an understanding of the properties of the Unix file system.

Unix File Systems 101

When most people think of a file on their computer they percieve a one-to-one mapping between file names and data; that is, a file and its data are synonyms. Under Unix this isn't the case. Unix treats file names and data as two separate things. What you might think of as the data in the file is actually known to the operating system as a unique number called an inode. Files are simply references to an inode. When you create a new file the file system allocates a new inode and associates the data you're saving with it. The name you specified for the file references the inode. When you delete the file you're deleting the reference to the inode. When the file system sees that there are no longer any references to the inode it knows that the space being used by that inode can be reclaimed.

Since multiple names can reference the same inode you can save disk space and headaches by having multiple names for the same inode. The process for creating a new name for an inode is called linking and there are two APIs (application programming interfaces) exposed by POSIX (the portable operating system interface [for unix]) that manage links: link and unlink. As you can gather from the name link creates a new name while unlink removes a name.

Linking and unlinking inodes is an atomic operation -- an operation that occurs instantly without the possibility of being interupted part way through. When a new link for a file is created the file is completely accessible, there's no delay while the data is copied on the disk. Simply put linking amounts to adding some accounting information in the file system but doesn't need to shuffle data around on the disk. This property of atomicity can be used by any application that requires that readers have a consistent view of the world.

Another feature of the Unix file system, and file systems in general is that the metadata -- the data about the files -- is stored separately from the file data. The stat information includes, but is not limited to: creation, modification and access times as well as size and ownership information. Having this information at the file system level means that it's relatively inexpensive to load information about a file or directory. Loading information about a file is called stating a file based on the name of the POSIX function stat. Since this information is standard information stored by the file system and isn't buried within a data file access to this data doesn't require specialized APIs -- the ones exposed by the OS are sufficient.

How Maildir Exploits the Unix File System

The features of the Unix file system mentioned previously are exploited very well by the design of Maildir. In order to ensure currency Maildir stores mail that is being received in a temporary directory, tmp. This ensures that readers of the mailbox, who read messages from the new and cur directories will not see messages that have not been completely received. Once a message is completely received it is linked over to the new directory. At the moment that the file system is updated with the new link to the file in new, all readers of the mailbox can see the message -- and the message they see is the full message, not a partially written one. In a similar vein, the stat information makes is very easy to get the size and received date of each message without having to parse the message file.

These benefits are extremely tangible, especially when considered in concert with the notion that a user's mailbox is generally located on a remote machine mounted using a Networked File System (NFS). Accessing files over a network is an order of magnitude slower than accessing files that are local to a particular machine; however the networked file system is presented to the system as any other file system -- neatly abstracted away behind the standard POSIX calls. From a user's perspective an NFS doesn't look any different than any other file system, it's just slower. But since Maildir is optimized to use the POSIX calls for many common operations, the whole experience is greatly improved.

Consider the case where an MUA wants to get a listing of the number of messages and sort them by date. An MUA that uses MBox would need to first lock the mailbox file, then read and parse its entire contents before it could provide the listing. For large mailboxes this is an expensive operation -- and an order of magnitude more expensive when the mailbox is on a remote machine. The MUA using Maildir would need to simply stat the file system. The data returned doesn't need to be parsed since its already in a standard POSIX structure. To get email specific information from each message such as subject, sender, etc Maildir helps by reducing the amount of I/O that needs to be performed. An MUA can simply read the first few lines of any message it's interested in to parse the headers. Contract this with MBox where it must read the entire mailbox file since the only way to know that you've reached the last message is when you hit the end of the file.

An Example -- How MTAs Accept Mail for Maildir Mailboxes

As mentioned previously the Maildir directory consists of at least 3 subdirectories: tmp, new, and cur. When writing new mail to a Maildir mailbox the tmp and new directories are used.

When a new message arrives, the system constructs a (possibly) unique identifier for the message. The MTA then checks tmp directory for a file with the same name. If a file with that name exists the MTA waits for 1 second and constructs a new unique identifier and tries again until it finds a unique identifer that doesn't clash with an existing file. The 1 second delay allows enough time for any time-related information used in the unique identifier to change.

The MTA then writes the message data to the uniquely named file in the tmp diectory. When the file has been completely recieved a new link is created to the message file in the new directory. The file is then unlinked from the tmp directory leaving a single link to the file in new.

Conclusion

The design of Maildir leverages the facilities provided by POSIX to produce an elegant solution that is simple, efficient and robust. Its clever use of the Unix file system and associated POSIX APIs demonstrates that it's possible to design a robust and scalable inter-process communication (IPC) mechanism using nothing but these facilites. In general, Maildir is a great example of what can be accomplished by having an understanding of the pieces of technology that a system is built on.