Native File Format for Preservation of Cloud Emails

Definition of Native File Format

Native file format is defined as the format in which a document was created and maintained by the creating software or system. When we talk about local emails, identifying the native format is relatively straightforward. For instance, in a Microsoft Exchange / Outlook environment, the native file format for the emails on the Exchange server would be the Exchange database—“.edb” (MAPI-based Database) and “.stm” (Streaming Database) files.

Locally cached copies of the emails would typically be stored in Offline Storage Table (OST) files and archived emails would be stored in Personal Storage Table (PST) files, which would be the native file format for local emails.

Near-Native File Format

When preserving or producing data in its true native format is not feasible, data is often converted to near-native format. The conversion from native file format to near-native format must be performed in a way to maintain the essence of the original file. So, a near-native file usually includes most of the original metadata of the native file.

It is important to note that not every conversion results in near-native format. For instance, the MSG version of an email message found in a PST file can be considered a near-native version of the original file as it retains most of the original metadata. I say “most”, because even PST to MSG conversion results in changes to MAPI properties such as PR_LAST_MODIFICATION_TIME and PR_LAST_MODIFIER_NAME (see Stephen Griffin’s post here), which should be taken into consideration especially in the digital forensics context.

On the other hand, creating a PDF version of the same email—even a searchable PDF—does not result in a near-native file.

Cloud Emails

While discussing cloud email preservation with attorneys and eDiscovery project managers, preservation in native format comes up quite often. “We have a few custodians using Gmail, whose email we need to preserve in native format…” the conversation often starts. We then take a detour and talk about how “native file format” can be a bit ambiguous when it comes to data in the cloud. Instead, it is often best to focus on what would be a reasonable near-native file format in lieu of the native format for cloud data.

Let’s look at a few popular cloud email scenarios:

Gmail

A Gmail end user typically does not have access to the underlying data structures on Google’s servers which represent the native file format of the emails. Presumably, Google stores the emails in a database structure to facilitate efficient search and retrieval. This database structure may contain email data for multiple—perhaps thousands of—customers.

Instead of dealing directly with the back-end data structures, end users access their emails via the web-based Gmail graphical user interface, or download them using an email client, typically using Post Office Protocol (POP), Internet Message Access Protocol (IMAP) or the Gmail REST API.

When emails are downloaded via IMAP, they are transmitted in MIME format. Similarly, Gmail REST API allows software to retrieve a message in “raw” format, which is the message in MIME format as a base64url encoded string.

Gmail data can also be downloaded using Google Takeout. Google Takeout provides the emails in Mbox format, which is essentially concatenated MIME messages with a few caveats.

So, when we preserve emails by downloading them from Gmail, the closest we can get to native file format—short of accessing Google’s back-end infrastructure, which is unlikely—is MIME format. It is possible to move to MSG or PST format, which are very commonly used during eDiscovery processing and productions, after performing a conversion from MIME or Mbox—although I would like to avoid this additional conversion if possible.

Hosted Exchange/Office 365

As with most cloud services, the end user in a hosted Exchange environment does not get to deal directly with the back-end data store, but can access her data in a few ways such as by using Outlook Web Access, Outlook—which typically communicates with Exchange via MAPI/RPC, Outlook Anywhere or MAPI/HTTP, ActiveSync (for mobile devices), Exchange Web Services (EWS) and IMAP.

When messages are downloaded via EWS or IMAP, they are typically transmitted in MIME format. For instance, the full email message can be retrieved by using the GetItem operation in EWS and requesting the ItemSchema.MimeContent property.

IMAP

Many other cloud email providers such as Yahoo Mail, AOL Mail, Zoho and iCloud allow end users to access their mailboxes via IMAP. As I mentioned above, when the IMAP protocol is used, emails are transmitted in MIME format. So, again, MIME format would be a good choice for a reasonably usable near-native format for emails downloaded via IMAP.

Conclusion

When dealing with cloud email, obtaining the email data in its true native format is usually not feasible. Typically, forensic preservation takes place by utilizing the public application programming interface (API) or email transmission protocols the service provider exposes. Before committing to produce in native file format, or expecting a native format ESI production, it is important to understand the various options for retrieval of the email messages so that an informed decision can be made.

For instance, if emails hosted in a Gmail account are of interest, MIME format—as they are transmitted by the Gmail servers over IMAP or Gmail REST API—can be a reasonable near-native alternative to the native file format.

References:

Multipurpose Internet Mail Extensions (MIME)—https://tools.ietf.org/html/rfc2045
Internet Message Access Protocol (IMAP)—https://tools.ietf.org/html/rfc3501
Exchange Storage Architecture—https://technet.microsoft.com/en-us/library/bb124808(v=exchg.65).aspx
Gmail API Documentation:Users.messages: get—https://developers.google.com/gmail/api/v1/reference/users/messages/get
EWS MimeContent Property—https://msdn.microsoft.com/en-us/library/office/microsoft.exchange.webservices.data.item.mimecontent(v=exchg.80).aspx