Be Wary of the Paranoid

I recall distinctly when first learning about Rails and plugins that the very first one I used was acts_as_paranoid. Something about actually deleting data concerned me and so I figured adding acts_as_paranoid to some important tables in my application would save a lot of headaches. While it is tremendously useful it also has a pretty big unintended consequence that I think gets overlooked by most. Lets take an example query generated by acts_as_parnoid.

>> User.find_by_id(7)
SELECT * FROM `users` WHERE (`users`.`id` = 7) AND (users.deleted_at IS NULL OR users.deleted_at > '2009-11-26 02:17:28')

Looks harmless right? What I want to draw attention to is the "OR users.deleted_at" part. In order to ensure that the user isn't deleted it checks not only that the deleted_at field is NULL, but also that deleted_at is greater then the current time. In reality the IS NULL check is sufficient unless you are setting the deleted_at of some object to be in the future. I have yet to see anyone actually use it in that way. This is what makes the use of the current time in the query so bad, it is slowing down tons of Rails applications and most people don't even know it.

One important thing to notice about the MySQL query cache is that it is pretty dumb. Basically it caches the incoming query string exactly as written and then stores its associated result set. This becomes a problem when you use something like the current time in the query string, it functions as a cache-buster each second. So at 0 seconds you make a query and it is stored in the cache, then at 0.5 seconds you make the same query and it is read from cache, then at 1.0 seconds you make the same query but it will miss the cache since the time has increased by a second. This means that anything written to the query cache which uses acts_as_paranoid effectively has a 1 second expiration time. That's awful , and all that for the 0.005% of users who want to expire things in the future. Not to mention the fact that it completely pollutes the cache with old data which never gets touched a second after its written.

Alright, enough moaning, here's how to fix the problem. Open up paranoid.rb and in the "with_deleted_scope" function rip out "OR #{table_name}.#{deleted_attribute} > ?" along with the current_time variable after it. Similarly in has_many_through_without_deleted_assocation.rb in the construct_conditions method delete the same string where it is appended to the conditions variable. Keep in mind that if you are setting deleted_at to values in the future then you don't want to make this change. But for everyone else, enjoy the improvement in your query cache hit rate.

As a final note, for the tables which you have made paranoid you probably also want to consider adding an index which includes the deleted_at field since it will be a condition of every SQL query on that table.

Updated: There is a fork of acts_as_paranoid courtesy of mikelovesrobots that provides the fixes that I talked about previously, it is available here. I'm gonna switch out my versions of acts_as_paranoid for this one, I'd suggest you do the same.

Lacking in Persistence

A few weeks back was the very first incarnation of Cloud Camp in the Cloud. While I was unable to attend I did get around to watching the screencast of it courtesy of @ruv (available here). There was certainly a bunch of good information and intelligent discussion but I found one question to be particularly interesting and insightful. The astute attendee asked "Why was Amazon EC2 designed such that instances have transient (ephemeral) storage? Rackspace has been pushing their marketing on the fact that their servers have persistent storage is this a big deal?" That is certainly a loaded question but I think I'm in a reasonable position to take a shot at it.

Lets start by talking about some of the reasoning behind why Amazon would make their instances transient. The first reason is simple, making instances transient makes life a lot easier for them, particularly given their scale of operation. If they wanted to make instances persistent they would need replicate that data at least twice, if not three times, with at least one being in another data center. Imagine the amount of traffic that would be needed to keep a write-heavy database server consistent across multiple disks. Also, all the data to that last replica has to ship over the intertubes means say hello to Mr. latency. Ever notice that Amazons EBS volumes are replicated only within a single availability zone? I would be willing to bet this is because of network traffic and latency concerns.

One thing you will notice about a lot of the cloud providers that provide persistent instances (e.g. vCloud Express, ReliaCloud) is that they break that nice little "pay-per-use" model everyone is so fond of. In order to maintain those persistent instances most providers charge you even if they are "powered off". ReliaCloud instances are roughly 1/2 price when powered off while vCloud Express charges the full price. Rackspace has a somewhat different approach. First they RAID 10 the disk which should make failure less likely. In addition, they claim that if a failure occurs they will automatically relaunch your instance for you complete with data. How long does that fail-over take? They don't say and I have a hard time believing you will get a guarantee from them.

Taking a step back, what you will find is that Rackspace is really the midpoint between EC2 and vCloud in terms of persistence. On EC2 if an instance fails your it is gone along with your data (unless your using EBS). On vCloud, if your instance fails or you power it off it still persists. Rackspace falls in between in the sense that if your instance fails it will come back (with some delay) but if you shut off your server its gone along with its data. Thus, the only real way to make your data persistent on Rackspace is to keep the server running (at full price), or dump it into CloudFiles. This points out one of the really nice benefits of EBS which is that you can have persistent data without needing an instance to store it on (read: cheaper). But why would you want to store data without an instance attached to it you might ask? Its simple, there is probably some portion of data that you would like to keep persistent (e.g. database) and while you could dump it onto CloudFiles/S3, reading that data back onto a newly launched instance can take a loooong time. This is what it was like on EC2 pre-EBS and it wasn't pretty.

Now the trickier part of the question which is whether this persistence makes Rackspace a more attractive cloud infrastructure service. Having machines come back after failing is certainly a nice feature but it you still need to have a second server for fault-tolerance if you want to attain a reasonably high availability. While it may take some time it is almost certainly faster then the time it would take you to figure out that your EC2 instance has died and having you hustle over to your laptop to fire up a replacement. The bottom line is that it probably not available enough for any reasonable sized service to rely on it without a proper backup. On the other hand if you are hosting something like your blog where a few minutes of downtime isn't a critical then it can be a handy feature.

On paper it certainly looks nice, but in my opinion its not really a huge benefit. In all the time I have used EC2 I have only ever seen a few instances fail and it was after I had accidentally set the Java heap size to be 10x the available memory and then ran a big Hadoop job. The machine was thrashing so hard I'm not surprised it died. Aside from that (very extreme and operator induced) case have never seen an instance fail. In my experience EC2 instances simply don't fail frequently enough for this to be a big deal to me.

To be honest I think the fact that Rackspace allows you to "grow" your instance size is a much more attractive feature, but that's for another post...

For an interesting and thoughtful comparison of EC2 and Rackspace I would take a look at this blog post. He does a good job hitting on many of the important points and even agrees with my thoughts on the benefits of EBS.

Boy and Girl Birth Rate Interview Question

I came across an interview question which is pretty interesting and a bit of a head fake (in my opinion). The question is...
"In a country in which people only want boys, every family continues to have children until they have a boy. if they have a girl, they have another child. if they have a boy, they stop. what is the proportion of boys to girls in the country?" --courtesy of fog creek software forums
After thinking about it for a little while I though I had the answer, but of course I was wrong. After reading the right answer I still couldn't quite convince myself that it was true, so I figured I'd test it out.

Sure enough the answer generated matches up with the explanation provided in the link above. So there you have it...