I have been asked to forensically preserve hosted mailboxes countless times. I didn’t keep a tally, but Gmail—along with G Suite—was by far the most common target provider. Gmail’s servers are generally faster than most other free email service providers, which allows data acquisition to move along nicely. However, the label system—as much as I love the flexibility as an end user—can make things a bit more challenging.
Gmail Labels
Gmail labels are a great way to organize messages. They are similar to folders, but multiple labels can be assigned to each message. So, you do not have to make multiple copies of a message in order to assign different categories to it—similar, in a way, to tags in a legal review platform.
Some of the Gmail labels such as “inbox” and “sent” are created by Gmail (i.e., system labels), while others can be user created. System labels are reserved and prefixed by “[Gmail]” or “[GoogleMail]”. You can search for messages that have a specific label by using the “label:inbox” syntax (“in:inbox” and “l:inbox” also work). Searching for “in:inbox” is also a quick trick to get out of the Gmail priority inbox and see all your inbox contents together.
Gmail Forensic Preservation
Let’s take a look at a few common ways to forensically preserve a Gmail mailbox and how labels affect the outcome.
Gmail Forensic Preservation via IMAP
When an Internet Message Access Protocol (IMAP) client connects to Gmail, Gmail labels are represented as folders. This allows IMAP clients to work with Gmail labels—even modify them using standard IMAP commands such as CREATE, RENAME and DELETE that act on IMAP folders. However, this can potentially cause massive duplication since labels can overlap. For example, let’s look at a Gmail mailbox which contains 42 e-mails in total. The mailbox is presented to the IMAP client as in the folder structure below:
Figure 1 – Example Gmail Forensic Preservation
If all presented folders were downloaded, 111 messages would be preserved instead of 42. This is because some of the messages have multiple labels applied to them, so they are presented in more than one folder. For example, 14 items in “Sent Mail” were applied the “Important Messages” label, which is displayed as an additional folder.
At first sight, it may seem like a good idea to exclude the “All Mail” folder from the preservation, as its contents should also be in the other folders. That is not the case. When a user archives a message, its inbox label is removed. Archived items can be found under “All Mail”. This sample mailbox contains 4 archived messages, which do not have a dedicated label and are presented to the IMAP client only under the “All Mail” folder.
On the other hand, “All Mail” is not all inclusive, either. For instance, the 3 items listed under “Trash” are not included under “All Mail”. Additionally, if you only preserved “All Mail” via IMAP, most IMAP clients would not provide you with a list of which labels were applied to each message. This is technically possible—thanks to Gmail IMAP Extensions—by using the X-GM-LABELS attribute with the FETCH IMAP command.
Gmail Forensic Preservation via Gmail REST API
Modern email preservation tools typically connect to Gmail via its representational state transfer (REST) application programming interface (API). Since Gmail API does not carry the burden of conforming to the standards of a legacy protocol such as IMAP, it is much more flexible. The 42 mail items above can easily be retrieved along with a list of labels applied to each message.
In addition to the label names, Gmail REST API provides more information about each label such as its id, visibility of the label in the Gmail interface, visibility of the messages within the label, total message & thread counts, total unread message & thread counts, label name and label type (i.e., system or user label).
Finally, the REST API is significantly faster and more reliable for downloading and preserving messages compared to IMAP.
Gmail Forensic Preservation via Google Takeout
Another option for acquiring contents of a Gmail mailbox is by using Google Takeout. Google Takeout allows the user to choose a file type (.zip, .tgz and .tbz are currently supported) and a maximum archive size (larger archives are split into multiple files) and delivers the archived emails as a download link via email or by uploading the archive to Google Drive, Dropbox or Microsoft OneDrive.
The output from Google Takeout for Gmail is an mbox file, which includes the contents of spam and trash.
Here are my thoughts on the pros and cons of using Google Takeout:
Pros:
- mbox output format is a reasonable near-native alternative to the native file format for Gmail preservation. It is also supported by most digital forensics and eDiscovery tools.
- The emails within the mbox have their X-GM-THRID and X-Gmail-Labels fields populated in the message headers. So, similar to the REST API option, messages are downloaded only once and their labels are captured.
Cons:
- Depending on the mailbox size, creation of the archive may take a long time. Google says “hours or possibly days”. If the preservation must be completed expeditiously, which is usually the case, waiting for an undetermined amount of time without any progress update may be unacceptable.
- Initiating a Google Takeout request results in the creation of additional emails in the target mailbox. At a minimum, the mailbox would receive an email indicating that the archive is ready for download (in my experience, this email itself was not included in the created Takeout archive). When performing forensic preservation, I like to keep changes to the target system to a minimum. This clearly violates that principle.
- Many eDiscovery and digital forensics tools do not parse the X-GM-THRID and X-Gmail-Labels fields into their own queryable fields. So, although they are captured, these fields may not be immediately usable without some additional parsing work in some scenarios.
Conclusion
Gmail labels assigned to each message can often be a key piece of electronic evidence, and should be captured during Gmail acquisition. At the same time, downloading—and subsequently running through the processing and review workflow—each message several times over in order to capture the label structure is clearly not a good option.
My personal preference is to use the Gmail REST API for Gmail forensic preservation. This allows for a painless, fast and high fidelity preservation of the messages along with their corresponding labels. When we designed Forensic Email Collector, we added REST API support and reference files that provide a list of all labels assigned to each message.
Google Takeout is also a reasonable option if modifying the target mailbox is not a concern, if mbox format works well with your workflow and if you can live with the uncertain completion time.
References:
- IMAP Extensions | Gmail IMAP | Google Developers—https://developers.google.com/gmail/imap/imap-extensions
- Users.labels | Gmail API | Google Developers—https://developers.google.com/gmail/api/v1/reference/users/labels