Last week I completed the move of a web site hosted on a physical server to Amazon’s EC2 cloud computing platform. The experience was educational to say the least and so far I’m pretty happy with things. I’ll post a follow-up after I’ve had a few more weeks of experience with it. In the meantime, I thought I’d share some of our implementation details.
The website is christianaudio.com, one of the largest online audio book publishers and retailers in the Christian market. I’m a co-owner of the business (but don’t ask me anything about the books—I’m just the internet guy). Christianaudio not only sells physical audio books directly and through distribution, we offer most of our books for download in mp3 format. In addition, Christianaudio publishes its own line of audio books created in-house. To do that, we manage a small army of audio engineers and professional narrators.
Prior to last week, christianaudio.com was hosted on a single server at The Planet. Database, web, FTP, and all of our downloadable content all on one machine and three hard drives. It was a testament to humble beginnings but also one monster of a single point of failure. I’d estimated that it would take us approximately 24 hours to recover from a catastrophic hardware failure (requisitioning a new server, reconfiguring, restoring from backups, etc.). The online side of the business in recent months has been growing like crazy so we decided now would be a good time to upgrade our infrastructure.
Our setup at Amazon is fairly typical. We now have multiple EC2 instances on the front-end running Apache and PHP. They all connect to a MySQL database which uses EBS for persistent storage. All of the virtual machines are Small instances. In addition, we have an FTP server which is used by our audio book production staff and narrators; for transferring downloadable audio books to S3; and to share files with business partners.
Converting for scale
Re-hosting the site on Amazon’s platform was straightforward and required almost no changes to our code. There were just two issues that had to be addressed to allow our app to scale: audio book downloads and content management.
Session handling was also an obviously important issue but we were already using MySQL for session storage so no changes had to be made. I’ve had great success using memcached for sessions over at bighugelabs.com so it’s on my list to switch. Storing sessions in memory will improve I/O performance on the DB server.
Audio book downloads
On our old site, audio books were stored on physical hard drives attached to the server. When a customer downloaded a book, we’d simply grab the bytes from the file system and send them down the pipe. We were already using S3 as a backup system for our audio book files so it was a no-brainer to simply get the books from S3 as needed instead of storing them all locally.
Each of our books is split into chunks of about 30 megabytes each (64kpbs mp3), each chunk ending at a chapter, with between two and ten chunks per book (some have a lot more). On our old server we’d just scan the file system for all of the chunks and present them to the customer for download. Scanning S3 is relatively slow so what I did was build a local list of all of our audio book chunks by simply creating a zero length file with the same filename in a directory. So presenting the chunks for download works exactly the same as before.
When a customer downloads a chunk, a cache is checked first to see if the actual file is available. If it is, we serve the bytes. If it isn’t, then we get the file from S3, put it in the cache, and then serve the bytes. Periodically, a process sweeps the download cache and removes old files. This works great because bandwidth between EC2 and S3 is free.
Content management was a thornier problem. We have several kinds of content stored in the file system: audio books, audio book samples, podcasts, and image files. Images (product images, typically) can be uploaded directly via admin pages on the web servers. The problem is this: how do you get an image uploaded on www2 to www1 and www3? I’m using an app called Unison to manage content replication across our web servers. Unison is a general purpose file replication system. Each web server uses Unison to synchronize its content (everything in /var/www) with a master replication source on EBS (currently attached to our database server). So any content changes on any web server are first replicated to the master and then every other web server picks up the changes from the master. Unison works very well.
All of our audio is uploaded via our FTP server (also backed by an EBS store). Audio is processed differently depending on its type: podcasts and samples are rsync’ed to the master replication server and then replicated down to each web server via Unison. Audio book downloads are processed by transferring them to S3 and then updating the file listing cache replicated to each web server.
Other than that it’s a standard LAMP cluster and moving the entire system to a different cloud or back to physical servers would be almost trivial.
Each web server needs to know the internal IP address of the database server. What we’re doing is having our database server create a hosts file for our web server instances and save it in S3. When a web instance starts up, it replaces its /etc/hosts file with the one in S3. The hosts file has several different entries including dbserver and replserver. All of our internal processes use those names so, for example, if we ever move the master replication source to a new server none of our code will need to be changed.
If we ever need to restart the entire cluster, we just start the database server first and then the web instances. Or we can run the initialization scripts manually on each server as needed (or even as cron jobs). The scripts are written so that they can be run at any time and they will always do the Right Thing.
For now, we’re using simple round robin DNS for load balancing with a TTL of 5 minutes. This allows us to add or remove servers from our array and have the changes take effect within 5 minutes. In the event that we get an unexpected burst of traffic, we can launch a couple of extra instances and add them to the DNS. Then when they’re no longer needed, remove them from the DNS, wait 5 minutes, then turn them off.
No auto scaling here. Our traffic is pretty steady except for a few (very predictable) days a month when we send out our free download newsletter. I’m planning on launching a couple of extra web instances for our next newsletter.
Each of our front-end web servers has its own elastic IP address so if one of them crashes we can just launch a new instance and reassign the IP to the new server without having to make any DNS changes at all.
Amazon’s cloud computing infrastructure gives me a lot of peace of mind because it significantly reduces the effects of a catastrophic failure. I estimate that a hardware failure on our old system would have caused us a minimum of 24 hours of down time while we requisitioned and set up a new server and restored from backup. If the failure was to a hard drive serving our audio books it could take 24-48 hours to restore the books (gigs and gigs) from S3.
Our database server at Amazon is still a single point of failure but if it crashes we can have a new instance up and running in five minutes. Since the data is on EBS, the new server will pick up exactly where the old one left off (plus we keep hourly backups just in case). Same story for our web servers. And since all of our books are being served from S3 we no longer have to worry about a hard drive failure taking out our downloads or lengthy restores from backup.
In addition, Amazon’s cloud provides us with many different ways to scale. The most obvious based on the diagram is to add additional web server instances. But we can also scale by moving from Small to Medium to Large instances as needed or adding caching servers. If we outgrow our EBS storage we can simply create a new, larger EBS volume and copy over our old stuff. And we should never run out of space on S3 for our audio books so we never have to worry about scaling our storage like we did on our old system.
That’s all for now. I’ll post a follow-up about my experience running this stuff in a few weeks.