I’ve been tasked with backing up and archiving email messages from google apps accounts for a client of mine. We sat down and came up with a plan to follow. The goal was to export the email accounts by year from google. This would put them as mbox files. These files will then be loaded onto a computer so they could be accessible by thunderbird.
Along the way I’ve realized I am working with 100 GBs of data. I’ve also come across another way to export. I can export the messages and save them as pdfs. I can also export all the attachments to their google drive and from there move them to wherever. Ideally these files will then all be indexed to make them easy to search.
To me this sounds great. I love everything to be indexed. However, the way I search for things can be drastically different than how others LOOK for things. For example, a person could use labels or folders to organize their emails and then when they want to look for something they navigate to that folder or label. For me, I search by keywords. Well, if the emails are converted to all pdfs and indexed then they probably are not labeled or tagged in anyway there by confusing the person who searches by folder/label.
With my initial idea of exporting by date I have found out that searching by date in gmail doesn’t work exactly how they say it would. I’m not able to select ALL messages. it doesn’t select sent messages or it doesn’t select hangout conversations. It also sometimes thinks 2014 is the same as 2013! I’m not sure what is going on with this but I have wasted too much valuable time. Also, to create a filter you can’t filter just by date but by date and a keyword. I can’t believe this oversight on their part.
I think what I am going to test next is to just download the whole email box and break it out manually by year in thunderbird. I will be using the inporting/exporting plugin. My next problem is my file server doesn’t have the 100 gbs free on it and it is on a DSL connection. It would take forever to transfer all that data. What I am going to try is to put the data on S3, mount it and see if thunderbird can read the data. I don’t mind if it is a bit laggy. Sure, I could setup another server on EC2 for it but that would run a larger monthly cost and/or setting up an off and on script isn’t something I have done before.
With this in mind I will propose both ways as options and setup tests for each one to see how they work out.
Since I started this post a few months back I have gone through the steps to try and get the archiving to work. I am still failing at this! Thunderbird has a 2 GB folder limit (4 GB in linux) at I can’t even archive 1 email address for 1 year. I was going to do it by year but now after filling up my 2012 box I am about to say screw you thunderbird. You SUCK.
My next attempt will be to setup this open sourced mail archive program called Mail Piler. It is light on the details. My first install attempt didn’t go so well. The best install guide out there for this is with CentOS but while I feel confident in getting it up and running I am not so confident in my abilities to fix something if something goes wrong. All my linux servers are Debian varieties. With this in mind I am going to make another attempt with Piler. It sure seems like it will work for me. They even have a demo site that looked great.