How to Recover Lost Emails from Google Desktop Search's Cache

Introduction

Occasionally files may accidentally be deleted or corrupted or a storage medium may fail. When the lost file is an email mailbox/archive & there is no recent backup, the loss can be serious with thousands of emails lost at once. They could contain much of ones written work which one may want to reuse, vital business contracts and sentimentally preserved personal letters. However, if one has Google Desktop Search (GDS) installed, its cache might contain the text of many of those emails and it may be possible to not only view & extract that text but even reconstruct the emails.

Please treat this only as emergency resort and please make proper back-ups! Backing-up can be tedious but it can save an awful lot of frustration and loss later on.

Important Note about GDS Versions

This article was originally based on experiments in September 2006 with GDS version 2 (dating from December 2005) & M$ Outlook 2000 running on M$ Windows 2k/XP.My programs for automating the task were matched to the specifics of that version of GDS.

It is likely that the same principles will apply to other versions of GDS that operate similarly but details might vary. In particular the programs work by recognising specific details of the GDS web site structure and page formats to find their way around in GDS, so they might need adaptation to work with other versions. As the programs are Perl scripts, you can open them in a plain text editor and alter them as match your GDS formats if required.

Updates so far by visitors to this site:

  1. In April 2007 (GDS version 4.5) a user reported that the original programs did not work with that version of GDS.
  2. In August 2008 (GDS version 5.7.0806) another person (who wishes to remain anonymous) reported that it only needed a few small changes to work with the that version and supplied me with their updated version of my program.

Even if neither of those two versions of the automation program currently work when you need it and you are not minded to try updating the program yourself then at least the principles described here may still work and can be carried out manually, if you the patience. Good luck!

Explanation

For GDS to find files & emails quickly, it makes a text summary of each one it finds, stores that in its cache and builds up an index based on those summaries. It does not remove those summaries from its cache even when the original files or emails are deleted (it has been suggested to me that this feature is probably just because it is easier to keep everything than to delete individual items from an index). When asked to display an email that no longer exists, it displays its cached summary instead. For emails the summary seems to typically include the complete body text and usually the Subject, Date, To, From & other header fields that are common displayed.

Hence one can retrieve some of the lost emails by searching in GDS and saving the resulting cached copies.

Limitations

Some parts of emails seem not to be cached (or at least not displayed) by GDS. These include the attachments, inlined images, headers that are not normally displayed (the route etc.) and styling. Sometimes expected headers are missing. This seems to particularly happen with the From header. It might be when it recognises the recipient as the current user, it thinks it not worth bothering the user with a display of their own email address.

A really irritating problem is that there is no facility provided to extract all the emails or even a search criterion to list them all on a page (which would enable one to download them as emails with the common 'wget 'program with '-r -l 1' options. This limitation will addressed later on in this article by means of additional programs that do provide that facility.

Of course, installing GDS after emails have been lost won't recover them.

Therefore try other methods to recover the original email archive file (e.g. undelete programs) first. Better still, make back-ups and take care not accidentally permanently delete files.

Automating Extraction of Emails (GDS version 2)

I think all the emails can be found if one follows the links from the GDS home page to 'Browse Timeline' then 'email'. Repeated clicking '<Older' from there should progress through the complete history of cached (& still existent) emails. Each link to a cached email on each of those pages could then be clicked on and the resulting display of it saved as an HTML file. It would be very tedious to do manually though.

Here is a program I wrote to do that, extracting all cached emails from Google Desktop Search version 2, automatically: Download GDSCachedEmailExtractor.pl version 3 (5 KiB).

How to use it:

  1. System Requirements: GDS installed. A Perl interpreter.
  2. Open GDS.
  3. Copy the URL of the GDS home page.
  4. Open a text console window (aka "MS-DOS box" or "Command Prompt").
  5. Run the program from the command line with the GDS home page URL as the command line parameter. (I.e. type: perl GDSCachedEmailExtractor.pl "<GDS home page URL>" <Enter>.)
  6. The emails should (hopefully!) be saved as HTML files in a subdirectory of the current working directory called 'RecoveredEmail'.

Automating Extraction of Emails (GDS version 5.7)

An anonymous visitor to this site kindly provided an updated version to work with GDS 5.7. It requires one more step as it starts from the GDS timeline page rather than the home page but otherwise operates the same. As before it works following then 'email' link from the timeline and repeated following the '<Older' links to, hopefully, progress through the complete history of cached (& still existent) emails.

Here is the program I wrote with the visitor's modifications to extracting all cached emails from Google Desktop Search version 5.7, automatically: Download GDSCachedEmailExtractor.pl version 5 (5 KiB).

How to use it:

  1. System Requirements: GDS installed. A Perl interpreter.
  2. Open GDS.
  3. Click on 'Browse Timeline' to get the GDS timeline page.
  4. Copy the URL of the GDS timeline page.
  5. Open a text console window (aka "MS-DOS box" or "Command Prompt").
  6. Run the program from the command line with the GDS timeline page URL as the command line parameter. (I.e. type: perl GDSCachedEmailExtractor.pl "<GDS timeline page URL>" <Enter>.)
  7. The emails should (hopefully!) be saved as HTML files in a subdirectory of the current working directory called 'RecoveredEmail'.

Automating Conversion of Email Web Pages to Emails

Unfortunately a collection of several thousand web pages containing email text is not as useful as emails in a mail folder in an email client sorted by Subject, Date, From etc., although far more flexible than when they were just stuck in GDS's cache. Fortunately the most common mail folder format, Mbox, is very simple and basically consists of just the emails pasted one after another with a line beginning with 'From ' as the separator so one could copy the email body & headers from the HTML source code of those web pages, paste them in and add MIME type headers to tell the email client that email body is in HTML. Unfortunately the dates & addresses in the headers are in the wrong format for emails so those will need conversion too. That would be tedious to do manually for a few emails let alone thousands.

Here is a program I wrote to do that, converting emails saved from Google Desktop Search back to an Mbox email folder, automatically: Download GDSSavedEmailsToMbox.pl (8 KiB).

How to use it:

  1. System Requirements: A Perl interpreter. (GDS need not be installed.)
  2. Save the program to a directory containing the saved email web pages (one per *.html file) or a parent directory thereof.
  3. Run the program, e.g. by opening a console window in that directory and typing perl GDSSavedEmailsToMbox.pl <Enter> or by double clicking its icon.
  4. A mail folder/archive file should (hopefully!) appear called 'RecoveredEmails.mbox' in the current working directory.

I used the Mozilla Thunderbird variant of Mbox as most email clients can import that (for Thunderbird itself, just copy the file into the directory Thunderbird is using for email folder files and restart Thunderbird). Unfortunately lousy M$ Outlook 2k is one that can't (although I've read that it might be possible via M$ Outlook Express).

It might not covert all header formats correctly. I don't know the full range of possibilities from GDS (e.g. addresses were displayed in at least different 4 formats) and so made my program understand those I found in my sample data.

The Converse: Risks of GDS Caching

Of course there are risks to GDS keeping indefinitely copies/summaries of emails (and other documents) that one thought were deleted or password-protected. People whom one does not want to read ones documents might use it if they get hold of one's computer (or even just by obtaining the hard disk secondhand in the future unless it has been thoroughly wiped). Anything from family members finding out about their birthday presents to rival businesses finding out commercial secrets. Hence it seems a good idea to periodically flush the cache of GDS of deleted files and/or you carefully ban it from indexing confidential areas. The only obvious way I know of flushing GDS's cache of deleted files is the drastic one of deleting all of GDS's index & cache files completely after checking emails are backed-up successfully and let GDS timeconsumingly fully rebuild the index.

I cannot give more experienced advice anent GDS as I have not used GDS on my own PCs. Instead I used the old fashioned method of taking the time to save files systematically under filenames and directories where I can find them in the future by either name search or by position in the directory tree.

A much better Alternative: Back-up!

I seriously suggest that you back-up your data properly in the first place. It is much easier to recover from proper full back-up copies than scavenging accidental back-ups from GDS etc.. There are many ways to do back-ups, the routine I use is:

  1. Don't use M$ Outlook if you can avoid it. It has many deficiencies as a email client. Unfortunately many employers mandate it.
  2. Turn off 'autoarchive'. It causes old emails (by default, those over 2 weeks old) from the where they appear to be to a hidden PST file whilst making it look in the client view as if they were still in the original place. I have known people not realise that their email, which they thought was still safely on the server, was on their local hard disk and loose the lot when upgrading to a new computer!
  3. Make sure you know where the PST files you create are. Preferably they should not be in some hidden place that does not get backed up.
  4. Keep recent stuff on the server (assuming server is backed up of course) and periodically move the emails from the server from to folders the local PST file(s) instead of moving them immediately. Basically I treat my PST files as manually operated email archive locations not as live working mail folders. After doing that archiving, Outlook 2k will have read-locked the PST files it has written to so I shut down and restart Outlook to release that lock so that the PST files can be backed up during the next back-up run. (I might also make a manual cop-&-paste duplicate of the PST files before restarting Outlook if I am feeling extra paranoid.)
  5. I have a script that runs daily and copies any files it finds changed to a 2nd hard disk so I loose no more than a days work if there is a hard disk crash and I have a daily roll-back option. The files backed-up include the PST file (provided Outlook is not read-locking it).
  6. I regularly back up all my data files to a DVD-RW, using 2 rotated in a father/grandfather scheme. I store the most recent back-up in a separate building to the PC lest the PC & my cupboards are destroyed together in a fire.
  7. Each year I make a permanent back-up of all my data files on DVD-Rs. This is lest I accidentally damage old files which I later need without noticing until after both my rotating back-up DVD-RWs have been reused. (A less frugal alternative would be to make permanent DVD-R back-ups every time rather than reusing DVD-RWs.)
  8. Each year I duplicate the emails I want to keep into a robust MBOX format lest my computer is replaced by one without Outlook (M$ have been notorious in using many incompatible mailbox formats - MS Mail, MS Exchange Client, MS Outlook and MS Outlook Express use different formats and cannot all read eachother's - whilst most other programs have stuck with easily convertible formats or MBOX or something close to MBOX). There was no facility in Outlook to export to another format but I found I could do it by copying the emails back to the Exchange mailserver, downloading them into DBX format using M$ Outlook Express (which I think uses Outlook via OLE - at least it needed Outlook running to work) then importing from that into MBOX using Thunderbird.

Why I created this Article & the Programs

A friend of mine accidentally lost his email archive when a back-up failed (lousy M$ Outlook 2k unnecessarily read-locks, not just write-locks, its PST mail archive files which prevent the back-up working) and a faulty back-up (which turned out to be 47 MiB of null bytes!) was then accidentally copied over the original. My friend found that GDS had copies and so I created the programs to get them out of GDS back into a mailbox.