Lacking in Persistence

A few weeks back was the very first incarnation of Cloud Camp in the Cloud. While I was unable to attend I did get around to watching the screencast of it courtesy of @ruv (available here). There was certainly a bunch of good information and intelligent discussion but I found one question to be particularly interesting and insightful. The astute attendee asked "Why was Amazon EC2 designed such that instances have transient (ephemeral) storage? Rackspace has been pushing their marketing on the fact that their servers have persistent storage is this a big deal?" That is certainly a loaded question but I think I'm in a reasonable position to take a shot at it.

Lets start by talking about some of the reasoning behind why Amazon would make their instances transient. The first reason is simple, making instances transient makes life a lot easier for them, particularly given their scale of operation. If they wanted to make instances persistent they would need replicate that data at least twice, if not three times, with at least one being in another data center. Imagine the amount of traffic that would be needed to keep a write-heavy database server consistent across multiple disks. Also, all the data to that last replica has to ship over the intertubes means say hello to Mr. latency. Ever notice that Amazons EBS volumes are replicated only within a single availability zone? I would be willing to bet this is because of network traffic and latency concerns.

One thing you will notice about a lot of the cloud providers that provide persistent instances (e.g. vCloud Express, ReliaCloud) is that they break that nice little "pay-per-use" model everyone is so fond of. In order to maintain those persistent instances most providers charge you even if they are "powered off". ReliaCloud instances are roughly 1/2 price when powered off while vCloud Express charges the full price. Rackspace has a somewhat different approach. First they RAID 10 the disk which should make failure less likely. In addition, they claim that if a failure occurs they will automatically relaunch your instance for you complete with data. How long does that fail-over take? They don't say and I have a hard time believing you will get a guarantee from them.

Taking a step back, what you will find is that Rackspace is really the midpoint between EC2 and vCloud in terms of persistence. On EC2 if an instance fails your it is gone along with your data (unless your using EBS). On vCloud, if your instance fails or you power it off it still persists. Rackspace falls in between in the sense that if your instance fails it will come back (with some delay) but if you shut off your server its gone along with its data. Thus, the only real way to make your data persistent on Rackspace is to keep the server running (at full price), or dump it into CloudFiles. This points out one of the really nice benefits of EBS which is that you can have persistent data without needing an instance to store it on (read: cheaper). But why would you want to store data without an instance attached to it you might ask? Its simple, there is probably some portion of data that you would like to keep persistent (e.g. database) and while you could dump it onto CloudFiles/S3, reading that data back onto a newly launched instance can take a loooong time. This is what it was like on EC2 pre-EBS and it wasn't pretty.

Now the trickier part of the question which is whether this persistence makes Rackspace a more attractive cloud infrastructure service. Having machines come back after failing is certainly a nice feature but it you still need to have a second server for fault-tolerance if you want to attain a reasonably high availability. While it may take some time it is almost certainly faster then the time it would take you to figure out that your EC2 instance has died and having you hustle over to your laptop to fire up a replacement. The bottom line is that it probably not available enough for any reasonable sized service to rely on it without a proper backup. On the other hand if you are hosting something like your blog where a few minutes of downtime isn't a critical then it can be a handy feature.

On paper it certainly looks nice, but in my opinion its not really a huge benefit. In all the time I have used EC2 I have only ever seen a few instances fail and it was after I had accidentally set the Java heap size to be 10x the available memory and then ran a big Hadoop job. The machine was thrashing so hard I'm not surprised it died. Aside from that (very extreme and operator induced) case have never seen an instance fail. In my experience EC2 instances simply don't fail frequently enough for this to be a big deal to me.

To be honest I think the fact that Rackspace allows you to "grow" your instance size is a much more attractive feature, but that's for another post...

For an interesting and thoughtful comparison of EC2 and Rackspace I would take a look at this blog post. He does a good job hitting on many of the important points and even agrees with my thoughts on the benefits of EBS.


cv said...

Thanks for the write up, we do believe persistence is important, especially for non-elastic use cases. You can read more about our thoughts on persistence in our blog post entitled “Cloud Servers and EC2: Why Persistence Matters.”

Also, to your point on Rackspace offering a guarantee for failover, I might mention that we actually do offer a very competitive SLA around Cloud Server recovery. Take a look at our SLA here; hopefully it will have some answers to your concerns.

Please continue to give us your feedback. We are listening!

Chandler C Vaughn
Director of Product
The Rackspace Cloud

Post a Comment