Google Takeout and Google Vault are commonly used to export email evidence for digital forensic investigations and eDiscovery. We often receive questions about Google’s built-in export features, and how they compare to dedicated forensic email preservation tools such as Forensic Email Collector. In this post, I will take a close look at the data exported by Google Takeout and Google Vault, discuss their cons and pros, and compare them to third-party tools.
Let’s start with Google Takeout, which is available to a wider audience than Vault—Gmail users:
Export Options and Filtering
One of the major weaknesses of Google Takeout is its lack of customizability. At the time of this writing, Takeout only allows mbox output, and the only way you can narrow the data set down is by using existing Gmail labels.
This leaves no opportunity to perform a pre-acquisition search without modifying the target mailbox. On the other hand, dedicated forensic tools that utilize Gmail API are able to run instant in-place searches to narrow down the data set before the acquisition.
On the positive side, Takeout allows exporting numerous other data points from the end user’s Google account such as Photos, Fit, Keep, etc. In this post, I will focus on email data.
Once the export is started, Google indicates that the creation of the archive is in progress:
For small mailboxes, this is a non-issue. But, for a large mailbox, the fact that the archive may take possibly days to be created is not very encouraging. No progress indicator is offered during this process—so, it is hard to be sure if and when the archive will be created. We have received reports that a Google Takeout export sometimes fails to complete on large mailboxes, and that no indication of the failure is provided.
Google introduces two new emails to the target mailbox during the Takeout export:
One email that indicates when the export has been requested, and another one once the archive is ready. While this may be a useful security measure (after all, nobody wants their mailbox being exported without their knowledge), it is not ideal from a forensics standpoint. One of the goals of a forensic examiner is to minimize changes to the target evidence. Unfortunately, Google started sending notification emails to mailboxes even when a new app is authorized to access it. So, acquiring a mailbox using a forensic tool via Gmail API also results in one email being sent to the target, unless domain-wide delegation is used.
Let’s now take a look at the data Takeout provides:
takeout-20191212T041533Z-001.zip Takeout\ Mail\ All mail Including Spam and Trash.mbox archive_browser.html
The Takeout export contains two files—an mbox file containing all of the emails, and an html file with a basic description of the data. If multiple Gmail labels are selected individually, a separate mbox file is generated for each label.
Two things are notably missing:
- Detailed logs which can be used for quality control as well as to import metadata into digital forensics and eDiscovery tools
- Cryptographic hashes of the exported items
Since all emails are exported in a single mbox file, the export does not have a folder structure that reflects the Gmail labels. This is consistent with how Gmail works—the labels are simply tags that are used to categorize messages, rather than folders. That said, eDiscovery and digital forensics firms often prefer to have the output folder structure reflect the labels in the mailbox. If this is your preference, you have two options:
- Export each label into a separate mbox file and then piece things together
- Utilize the X-Gmail-Labels header field that Takeout inserts into the messages to construct file paths
Neither option is very elegant if you are seeking a folder structure. Alternatively, you can ingest the emails in a flat folder structure and capture the X-Gmail-Labels header field contents (if your tool supports it, or with custom scripting) to populate a GMAIL_LABELS type field to serve as a multi-entry file path field. We faced a similar challenge when designing Forensic Email Collector, and added the ability to optionally create output folder paths based on the Gmail labels applied to each message.
One of the first questions that comes to mind when doing forensic work is: How close are we to the native format? Let’s take a look at how Takeout’s output compares to the original message view in RFC 5322 format as displayed by Gmail through the “Show Original” menu item.
Here is a Gmail message that was exported using Google Takeout in Mbox format (most of the body trimmed for brevity):
From 1610964845928120745@xxx Fri Sep 07 15:56:38 +0000 2018 X-GM-THRID: 1610964845928120745 X-Gmail-Labels: Inbox,Opened,Category Promotions,Starred,Sonos,Amazon,Apple Delivered-To: email@example.com Received: by 2002:a0d:e786:0:0:0:0:0 with SMTP id q128-v6csp2160601ywe; Fri, 7 Sep 2018 08:56:37 -0700 (PDT) X-Google-Smtp-Source: ANB0VdZ1MynRH1/TDbCBT9jkF/UHC+quD2mGipsdkclVpYKB+wtGHG0rvY7yo7TmDlRydFua5wPh X-Received: by 2002:a1c:148f:: with SMTP id 137-v6mr5389221wmu.61.1536335797874; Fri, 07 Sep 2018 08:56:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1536335797; cv=none; d=google.com; s=arc-20160816; b=YN1q0cI0sBlgoAgWtQEpOHoNkZghdgM+/145lvmlTC+0STMmBc9guhDJUHQxI0m/Yl 2OZ0yiQg4GsvWTubNt25286agSGDpP10KGXg7RYVDADyKu0zQHMAgjmEJu9sgrIsaxql B9F4J/9OkYPg/bvlSPbeWRCyTJiUzxmZN7ZW2nK/9q2/2GGK+MHdwG6ES25HLw7+Q1mD s3oxj2Xf7vOwO6nUbK4VuFYBMEtyCdLZkNn/zsL/Vpy6ENCmS0ZjkGjNmgrXkBpuOwIZ zD9wG8biBiihKCsUjvaZ07ijtRIl2eRPNAqoHnRhtDrX5wS9uBpazrE20IgLREsBUogI pTSg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=feedback-id:list-unsubscribe:date:subject:to:reply-to:from :mime-version:message-id:dkim-signature:dkim-signature; bh=ECR58m+6ReYnmozlTaIKHGekLilnoHbvSkx5N6oLYek=; b=TesXQ7+5GbjUZM+3JwrntEbRzBxfSaUiVj/SQl0Brtuy36nggM+xWrKER5wQk1gYf1 w3jDmSnaLrp5IHg1TlFtNdOpMf6fl+rCRQM+VERr7fx9eEwRJ8ZxU09Ntinzwv86/saO b72IVUkPWrJc4Fw3jukHNcbYg0F6M/E3VBnS+i86/Wp846BL5H+pCZ1F8U3UUu27e7yl XmyRlKQmswNpcQzQSLEi96BuYHm9fTzarRNm3qgGAuLc+EVSRvZgIHI/7jncsewNIaaR IV4LxlZ8Md93e8IwssJpFWQWxTJ0qoeqgpudYi4GO9Guoq/q4eChGm/UQg1B8deg4X97 qcPw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass firstname.lastname@example.org header.s=mailjet header.b=YFluYwxF; dkim=pass email@example.com header.s=mailjet header.b=bvvgoaVd; spf=pass (google.com: domain of firstname.lastname@example.org designates 188.8.131.52 as permitted sender) smtp.mailfrom=93540883.AJsADKFA1H0AAAYrUVAAAAd9o_wAAAAIijYAAAAAAAYklQBbkpemail@example.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=producthunt.com Return-Path: <93540883.AJsADKFA1H0AAAYrUVAAAAd9o_wAAAAIijYAAAAAAAYklQBbkpfirstname.lastname@example.org> Received: from o12.p10.mailjet.com (o12.p10.mailjet.com. [184.108.40.206]) by mx.google.com with ESMTPS id q15-v6si6869154wmg.20.2018.09.07.08.56.37 for <email@example.com> (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 07 Sep 2018 08:56:37 -0700 (PDT) Received-SPF: pass (google.com: domain of firstname.lastname@example.org designates 220.127.116.11 as permitted sender) client-ip=18.104.22.168; Authentication-Results: mx.google.com; dkim=pass email@example.com header.s=mailjet header.b=YFluYwxF; dkim=pass firstname.lastname@example.org header.s=mailjet header.b=bvvgoaVd; spf=pass (google.com: domain of email@example.com designates 22.214.171.124 as permitted sender) smtp.mailfrom=93540883.AJsADKFA1H0AAAYrUVAAAAd9o_wAAAAIijYAAAAAAAYklQBbkpfirstname.lastname@example.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=producthunt.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; q=dns/txt; d=digest.producthunt.com; email@example.com; s=mailjet; h=message-id:mime-version:from:reply-to:to:subject:date:list-unsubscribe: x-csa-complaints:x-mj-mid:x-report-abuse-to:feedback-id:content-type; bh=X4/WO9Tug4IGFlvwWKGKzAN+0xgvo+lyFMtM1IM35iI=; b=YFluYwxF9rj6RVQ8Rof4jDcnJm7dovvHaXDFWST2UqQUyi39eA2exDFA5 v+Tojv7P8gtT0nGxoO+58X9lVbt5dS6smHbbz+Ep04IARA057ZfoFpRYFprq UTrMocemUxK1/pjanUP1lJEGq8Mp5rREq1Hnv643qjDQJR/H8OVAkw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; q=dns/txt; d=bnc3.mailjet.com; s=mailjet; h=message-id:mime-version:from:reply-to:to:subject:date:list-unsubscribe: x-csa-complaints:x-mj-mid:x-report-abuse-to:feedback-id:content-type; bh=X4/WO9Tug4IGFlvwWKGKzAN+0xgvo+lyFMtM1IM35iI=; b=bvvgoaVdY2bOicL7Sj9G/z48T5emFpbr8GwT8FCZEq6OeVQWFd72sJXtc YUiPg9dUG9PZWMbyq2H6x7qn7wdhZ2MWStHSnys+h/AcPHH9gg03JMceZs2J C0Wa8nxH9qtvtEze7UNmakxBM6KEuntb+pBjK79sIgs4qYbfhij4nw= Message-Id: <93540883.AJsADKFA1H0AAAYrUVAAAAd9o_wAAAAIijYAAAAAAAYklQBbkpfirstname.lastname@example.org> MIME-Version: 1.0 From: Product Hunt Daily <email@example.com> Reply-To: firstname.lastname@example.org To: email@example.com Subject: =?UTF-8?Q?Sonos_is_taking_on_Amazon=2C_Apple=2C_?= =?UTF-8?Q?and_Google_=F0=9F=94=88?= Date: Fri, 7 Sep 2018 15:56:37 +0000 List-Unsubscribe: <mailto:firstname.lastname@example.org> X-CSA-Complaints: email@example.com X-MJ-Mid: AJsADKFA1H0AAAYrUVAAAAd9o_wAAAAIijYAAAAAAAYklQBbkp-1f0Ost4AtRzqXcyOEfUjaAgAF1QU X-REPORT-ABUSE-TO: Message sent by Mailjet please report to firstname.lastname@example.org with a copy of the message Feedback-Id: 382213.402581:MJ Content-Type: multipart/alternative; boundary="=-T0QpPbeTLWTG1kb+KWtR" --=-T0QpPbeTLWTG1kb+KWtR Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Your Daily Digest from Product Hunt =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Sonos is taking on Amazon, Apple, and Google... all at once. <--- rest removed for brevity --->
When compared to the copy obtained using the “Show Original” menu or Forensic Email Collector, the only differences are the three highlighted rows. The first row is a separator line as defined by the Mbox specification and the presence of it is to be expected. Rows 2 and 3 contain additional information inserted by Google Takeout to indicate the Gmail thread ID and labels applied to the message.
While the benefit of altering the original message is questionable, in this case, the two additional header fields would not interfere with DKIM verification or other investigative techniques, so they do not pose a major problem.
Google Drive Attachments
One of the common challenges when preserving mailboxes is the presence of attachments that were inserted as links to cloud storage services such as Drive rather than as real attachments. At the time of this writing, Google Takeout does not provide an option to acquire such email attachments and preserve them along with their parent emails.
When targeting G Suite, it is possible to use Google Vault to export emails for digital forensics and eDiscovery. Vault is included in Business and Enterprise G Suite plans, but not G Suite Basic.
Google Vault’s output is very similar to that of Google Takeout, but Vault solves some of the problems that Takeout has for legal work. Let’s go over which problems are solved, and which remain:
Problems Solved by Google Vault Compared to Google Takeout
- Vault allows you to search the target mailbox using search terms and preview the results before export.
- Vault does not insert the X-Gmail-Labels and X-GM-THRID header fields into the message. Instead, they are delivered in a supplemental XML.
- Vault provides MD5 hashes of the exported files.
- Vault provides a status indicator during the export.
Problems Remaining with Vault
Even using Vault, there are still a few missing features and remaining problems:
- Vault doesn’t support acquiring Drive attachments of emails, or the revisions of such attachments.
- Downloaded emails are exported together in a container (mbox or PST). While this works in most cases, individual mail export in RFC 5322 format is ideal for forensic authentication.
- Vault provides a CSV file that indicates the number of items exported from each mailbox. While better than nothing, this is no substitute for detailed logs that could be used for quality assurance.
- Similar to Takeout, Vault doesn’t export the messages in a folder structure. I personally do not mind this, but it might be a challenge in some eDiscovery workflows.
Google recently introduced a security feature called Confidential Mode. Messages sent using Confidential Mode are not directly accessible, and not exposed through Gmail API as of this writing. Google Vault supports exporting full contents of Confidential Mode emails, and is a viable option to at least complement the output of forensic email preservation tools.
Confidential messages can be isolated using the search query label:confidentialmode. This query can also be used to apply holds and custom retention rules to confidential messages.
Since Google Vault is a built-in eDiscovery tool, it allows organizations to set up legal holds to retain data indefinitely to meet their legal or preservation obligations. Accessing the items subject to legal hold can be done through Vault, which is a separate process than acquiring emails through Gmail API using third-party tools.
Google Takeout and Vault are great tools that give individuals and organizations the tools to export their own data for legal use. Their output, while not perfect, is closer to “native” format than what can be achieved by collecting mailboxes using general-purpose email clients such as Outlook. That said, depending on the requirements of the case and the workflow to be used, it might be appropriate to use a dedicated forensic email preservation solution which can provide benefits such as flexible output formats, folder structure based on Gmail labels, detailed logs and item-level hashing, fault tolerance, and Drive attachment and revision acquisition.
If a third-party tool is used, I would recommend checking the presence of confidential messages with an instant in-place search using the label:confidentialmode query and supplementing the acquisition with a partial Vault export of the confidential items as well as any holds where appropriate.
Google Takeout, Google Vault, Gmail, and G Suite are trademarks or registered trademarks of their respective holders.