Occasionally files may accidentally be deleted or corrupted or a storage medium may fail. When the lost file is an email mailbox/archive & there is no recent backup, the loss can be serious with thousands of emails lost at once. They could contain much of ones written work which one may want to reuse, vital business contracts and sentimentally preserved personal letters. However, if one has Google Desktop Search (GDS) installed, its cache might contain the text of many of those emails and it may be possible to not only view & extract that text but even reconstruct the emails.
Please treat this only as emergency resort and please make proper back-ups! Backing-up can be tedious but it can save an awful lot of frustration and loss later on.
This article was originally based on experiments in September 2006 with GDS version 2 (dating from December 2005) & Microsoft Outlook 2000 running on Microsoft Windows 2k/XP.My programs for automating the task were matched to the specifics of that version of GDS.
It is likely that the same principles will apply to other versions of GDS that operate similarly but details might vary. In particular the programs work by recognising specific details of the GDS web site structure and page formats to find their way around in GDS, so they might need adaptation to work with other versions. As the programs are Perl scripts, you can open them in a plain text editor and alter them as match your GDS formats if required.
Updates so far by visitors to this site:
Even if neither of those two versions of the automation program currently work when you need it and you are not minded to try updating the program yourself then at least the principles described here may still work and can be carried out manually, if you the patience. Good luck!
For GDS to find files & emails quickly, it makes a text summary of each one it finds, stores that in its cache and builds up an index based on those summaries. It does not remove those summaries from its cache even when the original files or emails are deleted (it has been suggested to me that this feature is probably just because it is easier to keep everything than to delete individual items from an index). When asked to display an email that no longer exists, it displays its cached summary instead. For emails the summary seems to typically include the complete body text and usually the Subject, Date, To, From & other header fields that are common displayed.
Hence one can retrieve some of the lost emails by searching in GDS and saving the resulting cached copies.
Some parts of emails seem not to be cached (or at least not displayed) by GDS. These include the attachments, inlined images, headers that are not normally displayed (the route etc.) and styling. Sometimes expected headers are missing. This seems to particularly happen with the From header. It might be when it recognises the recipient as the current user, it thinks it not worth bothering the user with a display of their own email address.
A really irritating problem is that there is no facility provided to extract all the emails or even a search criterion to list them all on a page (which would enable one to download them as emails with the common 'wget 'program with '-r -l 1' options. This limitation will addressed later on in this article by means of additional programs that do provide that facility.
Of course, installing GDS after emails have been lost won't recover them.
Therefore try other methods to recover the original email archive file (e.g. undelete programs) first. Better still, make back-ups and take care not accidentally permanently delete files.
I think all the emails can be found if one follows the links from the GDS home page to 'Browse Timeline' then 'email'. Repeated clicking '<Older' from there should progress through the complete history of cached (& still existent) emails. Each link to a cached email on each of those pages could then be clicked on and the resulting display of it saved as an HTML file. It would be very tedious to do manually though.
Here is a program I wrote to do that, extracting all cached emails from Google Desktop Search version 2, automatically: Download GDSCachedEmailExtractor.pl version 3 (5 KiB).
How to use it:
perl GDSCachedEmailExtractor.pl "<GDS home page URL>" <Enter>.)
An anonymous visitor to this site kindly provided an updated version to work with GDS 5.7. It requires one more step as it starts from the GDS timeline page rather than the home page but otherwise operates the same. As before it works following then 'email' link from the timeline and repeated following the '<Older' links to, hopefully, progress through the complete history of cached (& still existent) emails.
Here is the program I wrote with the visitor's modifications to extracting all cached emails from Google Desktop Search version 5.7, automatically: Download GDSCachedEmailExtractor.pl version 5 (5 KiB).
How to use it:
perl GDSCachedEmailExtractor.pl "<GDS timeline page URL>" <Enter>.)
Unfortunately a collection of several thousand web pages containing email text is not as useful as emails in a mail folder in an email client sorted by Subject, Date, From etc., although far more flexible than when they were just stuck in GDS's cache. Fortunately the most common mail folder format, Mbox, is very simple and basically consists of just the emails pasted one after another with a line beginning with 'From ' as the separator so one could copy the email body & headers from the HTML source code of those web pages, paste them in and add MIME type headers to tell the email client that email body is in HTML. Unfortunately the dates & addresses in the headers are in the wrong format for emails so those will need conversion too. That would be tedious to do manually for a few emails let alone thousands.
Here is a program I wrote to do that, converting emails saved from Google Desktop Search back to an Mbox email folder, automatically: Download GDSSavedEmailsToMbox.pl (8 KiB).
How to use it:
perl GDSSavedEmailsToMbox.pl <Enter>or by double clicking its icon.
I used the Mozilla Thunderbird variant of Mbox as most email clients can import that (for Thunderbird itself, just copy the file into the directory Thunderbird is using for email folder files and restart Thunderbird). Unfortunately lousy Microsoft Outlook 2k is one that can't (although I've read that it might be possible via Microsoft Outlook Express).
It might not covert all header formats correctly. I don't know the full range of possibilities from GDS (e.g. addresses were displayed in at least different 4 formats) and so made my program understand those I found in my sample data.
Of course there are risks to GDS keeping indefinitely copies/summaries of emails (and other documents) that one thought were deleted or password-protected. People whom one does not want to read ones documents might use it if they get hold of one's computer (or even just by obtaining the hard disk secondhand in the future unless it has been thoroughly wiped). Anything from family members finding out about their birthday presents to rival businesses finding out commercial secrets. Hence it seems a good idea to periodically flush the cache of GDS of deleted files and/or you carefully ban it from indexing confidential areas. The only obvious way I know of flushing GDS's cache of deleted files is the drastic one of deleting all of GDS's index & cache files completely after checking emails are backed-up successfully and let GDS timeconsumingly fully rebuild the index.
I cannot give more experienced advice anent GDS as I have not used GDS on my own PCs. Instead I used the old fashioned method of taking the time to save files systematically under filenames and directories where I can find them in the future by either name search or by position in the directory tree.
I seriously suggest that you back-up your data properly in the first place. It is much easier to recover from proper full back-up copies than scavenging accidental back-ups from GDS etc.. There are many ways to do back-ups, the routine I use is:
A friend of mine accidentally lost his email archive when a back-up failed (lousy Microsoft Outlook 2k unnecessarily read-locks, not just write-locks, its PST mail archive files which prevent the back-up working) and a faulty back-up (which turned out to be 47 MiB of null bytes!) was then accidentally copied over the original. My friend found that GDS had copies and so I created the programs to get them out of GDS back into a mailbox.