One of the most elegant software designs I've seen is that of Dan Bernstein's
Maildir. Maildir is a way of storing email messages common to Unix platforms.
It's the successor to an earlier format called MBox. These two formats differ
in a very big way. To better understand the driving factors behind Maildir it's
important to know the flaws that it's trying to address. The flaw in MBox, it
turns out, is baked into its design and cannot be fixed without throwing out the
design and starting fresh; which is exactly what Dan did with Maildir.
The MBox format keeps all messages in a single text file in the user's home
directory, or anywhere for that matter. Each message is delimited by a From
line, that is, a line that starts with from:. As you can imagine, this design
makes is easy to count the number of messages in the file by counting the number
of from lines. Another benefit of this format is that since messagse are
appended on to the end of the file all of the messages are in chronological
order. These benefits, are not without their drawbacks.
Since MBox stores everything in a single file that file needs to be locked
anytime a process wants to write to it. That means that message delivery agents
(MDAs) need to lock the file while accepting mail for a user limiting the total
number of messages that can be accepted for a user at a given time to one.
Locking isn't the only problem that occurs when you store lots of information in
a single file. When a message user agent (MUA) wants to remove a message from
the middle of the file it has to rewrite the entire file without the message
that the user wants to delete. Along the same lines, in order to get a listing
of the messages in the mailbox the MUA must scan the entire file from the first
byte to the last -- apparently listing and deleting messages were not considered
a common use case. Another problem, albeit one that only reveals itself in
extreme cases, is that if a mailbox gets extremely large it could potentially
run up against the operating system's file size limitations. Lastly, the file
format is extremely fragile -- an errant From: line coult corrupt the mailbox
and render the mailbox unreadable to MUAs. When designing systems it's
important to balance speed, efficiency, robustness, and simplicity; MBox seems
to be designed for simplicity at the expense of others.
Clearly the bar is set fairly low for other mailbox formats. Luckily, Maildir
wasn't designed as an incremental improvement over MBox, it was designed to
maximize speed, efficiency, and simplicity; and it did so in an amazingly
elegant and robust way.
Enter Maildir
Maildir, as the name implies stores its data in a directory structure. This is a
crutial difference between Maildir and MBox. Maildir, unlike MBox purports to
be a lock-free system -- the mailbox can be read and written to at the same
time. Under high load this is a really good feature to have. The most
interesting piece is that Maildir maintains its lock-free design even when the
mailbox is stored on a remote system and mounted using NFS (network file
system).
Not only is the system lock-free it also doesn't fall victim to the other
problems faced by MBox; receiving and deleting messages are O(1) operations --
meaning that they take a constant amount of time to complete, instead of an
amount of time proportional to the amount of messages in the mailbox. Another
advantage to storing the mailbox as a directory is that the number of messages
can be found by counting the number of files in the mailbox directories -- the
MUA doesn't need to open the files at all unless it wants to get information
from them.
So how does it work? Maildir is a fairly simple system which has the nice side
effect of making it robust. A Maildir mailbox is a directory that contains 3
subdirectories to store messages: tmp, new, and cur. Messages that are
being delivered are placed into tmp while they are being received. Once
messages have been received they are moved to new before being read. After a
user reads a new message it's moved into cur.
So far this seems like a fairly simple system; but what are the files that
contain the messages called? Each message is given a unique name. The unique
name is made up of a combination of machine and process dependent information
such as the host name, current time, PID of the MDA delivering the message, etc.
By concatenating a few of these together a globally unique name can be created;
in the event that a collision occurs Maildir specifies that the MDA wait at
least one second before trying again. This timeout is meant only to ensure that
enough time has elapsed that the data used to create another unique name will be
different.
That's effectively all there is to the mailbox format. Certainly it's simple,
but how is it elegant or robust? To get a handle on that it's important to
have an understanding of the properties of the Unix file system.
Unix File Systems 101
When most people think of a file on their computer they percieve a one-to-one
mapping between file names and data; that is, a file and its data are
synonyms. Under Unix this isn't the case. Unix treats file names and data as
two separate things. What you might think of as the data in the file is
actually known to the operating system as a unique number called an inode.
Files are simply references to an inode. When you create a new file the file
system allocates a new inode and associates the data you're saving with it. The
name you specified for the file references the inode. When you delete the file
you're deleting the reference to the inode. When the file system sees that
there are no longer any references to the inode it knows that the space being
used by that inode can be reclaimed.
Since multiple names can reference the same inode you can save disk space and
headaches by having multiple names for the same inode. The process for creating
a new name for an inode is called linking and there are two APIs (application
programming interfaces) exposed by POSIX (the portable operating system
interface [for unix]) that manage links: link and unlink. As you can gather
from the name link creates a new name while unlink removes a name.
Linking and unlinking inodes is an atomic operation -- an operation that occurs
instantly without the possibility of being interupted part way through. When a
new link for a file is created the file is completely accessible, there's no
delay while the data is copied on the disk. Simply put linking amounts to adding
some accounting information in the file system but doesn't need to shuffle data
around on the disk. This property of atomicity can be used by any application
that requires that readers have a consistent view of the world.
Another feature of the Unix file system, and file systems in general is that the
metadata -- the data about the files -- is stored separately from the file data.
The stat information includes, but is not limited to: creation, modification and
access times as well as size and ownership information. Having this information
at the file system level means that it's relatively inexpensive to load
information about a file or directory. Loading information about a file is
called stating a file based on the name of the POSIX function stat. Since
this information is standard information stored by the file system and isn't
buried within a data file access to this data doesn't require specialized APIs
-- the ones exposed by the OS are sufficient.
How Maildir Exploits the Unix File System
The features of the Unix file system mentioned previously are exploited very
well by the design of Maildir. In order to ensure currency Maildir stores mail
that is being received in a temporary directory, tmp. This ensures that
readers of the mailbox, who read messages from the new and cur directories
will not see messages that have not been completely received. Once a message is
completely received it is linked over to the new directory. At the moment
that the file system is updated with the new link to the file in new, all
readers of the mailbox can see the message -- and the message they see is the
full message, not a partially written one. In a similar vein, the stat
information makes is very easy to get the size and received date of each message
without having to parse the message file.
These benefits are extremely tangible, especially when considered in concert
with the notion that a user's mailbox is generally located on a remote machine
mounted using a Networked File System (NFS). Accessing files over a network is
an order of magnitude slower than accessing files that are local to a particular
machine; however the networked file system is presented to the system as any
other file system -- neatly abstracted away behind the standard POSIX calls.
From a user's perspective an NFS doesn't look any different than any other file
system, it's just slower. But since Maildir is optimized to use the POSIX calls
for many common operations, the whole experience is greatly improved.
Consider the case where an MUA wants to get a listing of the number of messages
and sort them by date. An MUA that uses MBox would need to first lock the
mailbox file, then read and parse its entire contents before it could provide
the listing. For large mailboxes this is an expensive operation -- and an order
of magnitude more expensive when the mailbox is on a remote machine. The MUA
using Maildir would need to simply stat the file system. The data returned
doesn't need to be parsed since its already in a standard POSIX structure. To
get email specific information from each message such as subject, sender, etc
Maildir helps by reducing the amount of I/O that needs to be performed. An MUA
can simply read the first few lines of any message it's interested in to parse
the headers. Contract this with MBox where it must read the entire mailbox file
since the only way to know that you've reached the last message is when you hit
the end of the file.
An Example -- How MTAs Accept Mail for Maildir Mailboxes
As mentioned previously the Maildir directory consists of at least 3
subdirectories: tmp, new, and cur. When writing new mail to a Maildir
mailbox the tmp and new directories are used.
When a new message arrives, the system constructs a (possibly) unique identifier
for the message. The MTA then checks tmp directory for a file with the same
name. If a file with that name exists the MTA waits for 1 second and constructs
a new unique identifier and tries again until it finds a unique identifer that
doesn't clash with an existing file. The 1 second delay allows enough time for
any time-related information used in the unique identifier to change.
The MTA then writes the message data to the uniquely named file in the tmp
diectory. When the file has been completely recieved a new link is created to
the message file in the new directory. The file is then unlinked from the
tmp directory leaving a single link to the file in new.
Conclusion
The design of Maildir leverages the facilities provided by POSIX to produce an
elegant solution that is simple, efficient and robust. Its clever use of the
Unix file system and associated POSIX APIs demonstrates that it's possible to
design a robust and scalable inter-process communication (IPC) mechanism using
nothing but these facilites. In general, Maildir is a great example of what can
be accomplished by having an understanding of the pieces of technology that a
system is built on.