God, this blog is less than 24 hours old and I'm already going to shill for something. But don't leave! Trust me, this is some useful shit I'm about to present, namely why Rsync.net will completely rock your ass. Besides, it's not like I'm getting paid for this: I just get that warm fuzzy feeling that comes from doing good.
So basically I'm a data nut and an automation nut: I'm obsessed with backing stuff up and I never want to have to think about doing it. Like any true DIY'er, I've bought my fair share of somewhat sketchy hard drives over the years (including two from the infamous IBM DeskStar batch a few years ago) and have had them fail at the most inopportune times. My favorite story is when both of the aforementioned DeskStars died when I was only an hour away from beating Baldur's Gate II . . . after I had invested a good 60-70 hours doing as many of the side quests as I could. That vein in my forehead still goes crazy when I think about it . . .
My backup source is my Windows 2000 workstation: in the absence of a dedicated server box, it has both the usual personal junk (documents, music, application settings) and a good deal of development-related data (websites, several Subversion repositories, a plethora of MySQL databases, etc.). I've got Cygwin installed, so I cooked up a Bash script a few years back that aggregates all of the essential data nightly and then writes it somewhere. It's that "somewhere" that's changed a great deal over the years: in my quest for automated backup nirvana, I'd gone through a wide variety of solutions before finally settling on Rsync.net.
First was a simple Zip drive: I got a 250 MB disk which sat in the Zip drive permanently and was overwritten every night by the updated set of data. That worked fine for a while since it was nice and reliable but eventually 250 MB just wasn't cutting it so I had to upgrade to a more spacious media which meant going with a tape drive. Now, all of you in the network/server admin community know that good tape drives are fucking expensive. I ended up going with a variety of lower-end IDE Seagate models (the Hornet Travan 40 to be more specific) which were a good deal cheaper but also a good deal less reliable. I mean, I'm sorry, but as much as I love my data integrity I just couldn't see myself spending $500 or more on an enterprise-class tape drive. So, I owned three in the space of a year all of which failed after a few months. At this point, I decided to investigate the possibility of a remote backup service.
The first that I went with was Streamload: now, Streamload is a great service for uploading, storing, and sharing large media files but is a pain in the ass to deal with when you're trying to automate a backup solution. On the surface, it sounds awesome: you pay $10 a month to upload and store as much as you want on their servers. It's only when you download that you start counting against a monthly limit (10 GB/month with the plan that I was on). Sounds great, right? Well . . . not so much. In order to upload via the command line, you have to make use of a third-party Perl module, which is no problem, but once you start actually trying to upload things, the limitations of their protocol become apparent. The most egregious problem was the fact that it can't handle zero-length files, so if you just batch up a bunch of data and try to upload it (like I was) you won't always be successful. For instance, Thunderbird mail folders have a number of zero-length files so I basically had to tar all of the backup data before uploading it. This is a perfectly acceptable solution, except for the fact that it negates one of the real strengths of the Streamload protocol. You see, before you upload a file, Streamload asks you to generate hashes of random segments of the file which Streamload then uses to check to see if it has a copy of that file already exists on the system. If it finds a match, then you don't have to upload anything and you're account is basically just given a symlink to the already-existing file. So it operates a bit like rsync in that you don't have to upload anything that already exists on the server. But, when you generate a tar file to upload, all of those advantages go out the window and you have to send the whole 750 MB file every single night. To top everything off, the command line interface allows you to upload, but that's it: you can't delete existing files, you can't move files around, etc. This is understandable since you're not actually accessing your account, per se, but rather a public dropbox for your account that allows you to uploading things and allows for other people to send stuff to you. There's no command line way to access your private files: you have to do that through their web interface or through a number of GUI tools that they provide. So, since nothing is overwritten in your dropbox when you upload stuff via the command line, you end up with a new backup archive in your dropbox every night and you have to remember to go in every couple of weeks and clean things out. So, not exactly an automated solution.
The second-to-last destination on this crazy train was Amazon's S3 (Simple Storage Service) service: it was new, shiny, and exciting, but ultimately proved to be too immature and unreliable. Its cost structure is pretty straightforward: you pay $0.15/month for each GB of storage you used and $0.20/month for each GB of data transferred (both uploads and downloads count against this). It operates over HTTP to send and receive the data and uses a custom set of HTTP headers to provide authentication. You also get full control over your data: you can create directories, move files around, and perform basically any other management task you can think of. So, the service sounds pretty sweet so far: relatively low cost, provided by a reputable company, and uses established protocols to facilitate data transfer. Unfortunately, as I said earlier, it was just too immature: I was never actually able to upload anything to my account. I wrote my own, custom, FTP-like command line client, but even when I just tried to use the stock sample code provided by Amazon, it would just hang whenever I tried to upload anything. I tried for a week to get stuff to work, but to no avail. However, like manna from heaven, I stumbled across a link to Rsync.net while looking for solutions to my S3 problems.
Rsync.net is a simple service built specifically for remote backups over a network. These guys basically just run a series of Linux file servers and give you SSH access to your account which means that you can use any tool capable of operating over SSH (scp, rsync, sftp, etc.) to get data to and from your account. The cost structure is pretty simple as well: you can upload and download as much as you want, and pay only $2/month per GB of storage used. I ended up using rsync as my tool of choice for my backup process. For those of you unfamiliar with rsync, it's a stable, robust, and feature-rich *nix tool designed to synchronize the contents of directories between two servers. It analyzes the contents of the two servers and then only sends over the data that is different between the two, saving both time and network bandwidth and can also operate over SSH, meaning that the process can be completely secure. So, as you can imagine, this is exactly what I needed: you can set up key-based SSH authentication for your account which means that the rsync process can be completely automated. Add to the fact that the people that run the company 1) know what they're doing and 2) are extremely helpful. There aren't any vacuum-brained support personnel that just spit back answers from a script: any time you ask for help, you talk to an actual developer. I had a problem initially where the rsync process was dying when I was trying to delete files that were no longer present on the remote server and so I sent them an email. Within a few hours, I heard back and there was no slick double-talk or attempts to cover up what happened: they told me straight up that they had made some recent security changes that was causing rsync to incorrectly restrict the deletion of files. They sent me several emails over the course of fixing the problem, one saying that a temporary work-around was in place that should allow rsync to run successfully, and another saying that a complete fix was now in place. So, they're quick, courteous, and knowledgeable which is about all you can ask for in a support staff. Bottom line is that I've been using their service for two months now and haven't had a single issue other than the aforementioned delete problem. I can't recommend them highly enough.
So, if you have other services you want to recommend (or warn against) or just general remote backup advice, hit up the comments for this post!
Posted by FuriousGeorge