Regular Expression

Are You an API Company or a Product Company?

2012-04-15T19:19:00.000-07:00

There are two broad types of software companies in the world, API companies and product companies. API companies are those who are focused on providing the best and most useful API’s possible. While they too might build product(s) on top of their own API, it is clear that their API comes first and foremost. Two prime examples of this approach is EchoNest and Amazon Web Services.

On the other hand there are product companies who are solely focused on making their product as good as possible and don’t care about the developer community. While some of these companies go so far as to release an API, it is obvious from their actions that they really don’t care or allow anyone to build anything great on top of it. Examples of companies/products in this realm include tumblr and Google+.

This distinction became clear to me after a recent experience I had with a company named Songkick. Songkick gathers data on upcoming music concerts in the US and UK. They expose this data via a web application at songkick.com along with having a developer API.

I have been kicking around some web app ideas in my head in the music space. After deciding on one I wanted to build in the concert space I headed over to Songkick’s developer page to get API access. I put in my email and a short blurb about my idea and submitted it. I was told I would “receive a decision in a few days.” Arg. It was at that point I should have realized this wasn’t going to end well.

Four days later I received an email from Songkick notifying me that my request had been denied because according to their representative “we're holding off on issuing API keys to services that are that similar to our own functionality.” Seriously?! Did my idea have some overlap with theirs, sure it did. We’re all using the same data from the same API after all, what did you really expect?

After giving it some more thought I realized that the issue isn’t Songkick being extraordinarily picky about who they give API access to, the problem is that they are a product company, that unfortunately happens to have an API. Product companies worry about others taking their API and making something better than what they’ve made with said API. Contrast this with the behavior of an API company, in this example Amazon Web Services. When AWS was first released it was just a set of API which developers could use to launch servers. Shortly thereafter the folks at RightScale and enStratus realized that they could build a nice web console on top of it to make the server creation experience simpler and easier to use. For a few years both of those services spent a lot time taking those APIs and building useful tools on top of them that make them even better.

A few years after the API release, Amazon decided to release their own web-based console. Did they revoke API access for RightScale and enStratus because they provided “similar functionality” to what the AWS Console provide, absolutely not. In fact, AWS Console is just another application built on top of the AWS API, they get no special treatment. As a cautionary tale to those considering making the switch from one to the other, lets talk about Twitter. From the very early days it was clear that Twitter was an API company. They had their flagship web app, but by releasing an API they allowed developers to create tons of different client applications. Web apps, iPhone apps, Andriod apps, and even entire devices dedicated to tweeting. This strategy was working great for Twitter and allowed tweets to be written and consumed from almost any device that was programmable.

Everything was going great until March of 2011 when Twitter “Drop[ped] The Ecosystem Hammer” on its developers. What really happened was that Twitter made the decision to transition from an API company to a product company. Unfortunately that transition was made abruptly and with little warning. Many developers who had spent years developing on their platform were upset and ultimately stopped making applications.

If there is one thing I learned from my experience with SongKick and that you’ve hopefully learned from this post it’s this: the decision between being an API company is an important one and if not done correctly it can cause a tremendous amount of strife between you and your developer community. Make the decision, make it early, and make it obvious to everyone else.

Ship Code When It Is Ready

2012-01-01T15:42:00.000-08:00

My previous jobs had what I consider a fairly standard month long release cycle. The first three weeks of the were spend writing code. On the fourth week the code was dropped to QA where they would hammer on it, find bugs, and have developers fix them. At the end of the fourth week the code shipped.

I distinctly recall the final week before every release having an uneasy feeling in my stomach. Even though I had thoroughly tested my code and QA had signed off on it, I always worried that it would break horribly in production. Did I miss a corner-case? Was someone integrating with my code going to break it? Had there been regressions in the mad rush to get fixes in before the release? I’ll never forget that feeling....

That is why I was really surprised when I started working at turntable.fm. I distinctly recall when I finished my first (tiny) feature. After quadruple-checking the code to make sure all the digital i’s were dotted and t’s crossed I sent it off to code review. Another developer took a look and said “Yep, looks good.” I naively asked, “when is the next release?” To which he responded “Right now.” Uh oh, I was getting that feeling again... but just a minute later the code had been deployed and there was my shiny new feature staring me in the face, working just as I had hoped. It was a special moment.

This had a profound effect on how I think about deploying code. I often get asked what is our release process, to which I reply “we ship when the code is ready.” It’s just that simple.

I’ve observed many benefits of this release strategy:

Instant gratification. There is nothing like the constant positive reinforcement of frequently shipping shiny new features. As a developer this kicks my motivation into overdrive.
You’re users get a constant stream of new features rather than having to wait for what amounts to a “monthly feature dump.” After you push new features and ask your users if they can spot them, you’ll be surprised at how quick they identify them.
When each feature is pushed separately it is a lot easier to gauge your users response. Just like with A/B testing, to get a clear response you want to minimize all other factors.
All the integration and testing is done by the developer shipping the feature. This means if anything goes wrong its obvious who should be investigating the issue.
If there are problems it is easy to roll-back since your change is that only thing that got pushed.
Debugging any issues with the code is easier since the code is still fresh in your mind. Getting all of the code you wrote a week (or a month) ago back in your brain can take longer then you would expect.

Do you ship code when it's ready?

Managing Talent: A Running Metaphor

2011-11-27T19:23:00.001-08:00

When I first joined the cross-country team in high school I naively thought that running was all the same, you just ran. Then came the first day we ran hills. It was then that I realized that running has a lot more technique and skills then were initially apparent. One of the most important things I learned was how run down hill.

On that first day of running hills our coach told us to “really work it going up the hill and recover on the way down.” After reaching the top of the hill I began my “recovery” back down the hill (i.e. jogging slowly). However, I noticed that everyone else was flying passed me on the way down the hill. Not only that, but when they reached the bottom they seemed even more rested and recovered than I was.

After practice I asked one of the guys on the varsity team why everyone’s “recovery” speed was so much faster than mine. In true Southern California surfer fashion he said “Dude, you just got to learn how to let your legs go.” Let my legs go, what the hell does that mean? Is that some sort of weird surfer zen thing?

What he meant was, when running uphill you really have to churn your legs to keep moving. However, when you reach the top and start going down, your legs are already churning. With the help of gravity, your legs will keep churning unless you actively try to stop them. See also: Newtons first law. It is the same effect you get as when you drive down hill and put your car in neutral.

Once you figure out how to “let your legs go” you just have two things to worry about. The first is that you regulate your speed. Your legs can take you far faster than you can safely go, so you should constantly be regulating your speed to go as fast as is safely possible. The second thing is avoiding obstacles. It only takes a small rock or divot to catch you off guard and send you tumbling down a hill. The further away you see the obstacle the easier it is to avoid.

Managing talented people fundamentally uses the same technique as running down hill. Imagine your team as the legs and you, as the manager, are the head. When you have a talented team, they should be able to move incredibly quickly. While their ability to move fast is a good thing, it is your job as the manager to make sure that your team is going as fast as they can while still being safe. If you allow the team to move too fast they will eventually trip and fall. Hopefully its just a scrape or a bruise and they can get back at it. But sometimes the injury can be really serious and the recovery process is a long and painful one.

Avoiding obstacles is the second skill you need to master. As the manager you are the vision. You see whats coming both short and long term, it is your job to steer the team appropriately. You have to make sure you are tracking what is right in front of you as well as what is on the horizon. Its important to remember that the further away you see an obstacle, the more smoothly you can steer the team around it. If you are really good, your team wont even notice the small course correction you made a while back that let you “effortlessly” avoid the giant bolder that would have eventually blocked their way.

If there is one piece of advice I can give you about managing a talented team it is: “let your team go.”

Eatability Testing: Why don’t more restaurants do it?

2011-11-27T19:09:00.001-08:00

Two things I’ve observed about people in New York is that everyone jaywalks and no one cooks. Everyone eats out all the time. However, since most people are busy, a large amount of food is served via takeout and delivery (as opposed to eating in). This presents a challenge that few restaurants seem to realize: the long delay between when the food is prepared and when it is actually eaten.

How many times has this happened to you, it is late on Sunday evening and you call one of your favorite restaurants for a little delivery. You pick your favorite item off the menu, place your order, and wait in anticipation. Thirty minutes and a cash exchange later and food is on your dinner table. You pop open the styrofoam container, grab a bite, and yuck, it’s cold.

Across the street from where I work at there is a fantastic Vietnamese sandwich shop called Num Pang. Their menu is filled with tons of delicious sandwiches like ginger barbecue brisket and hoisin veal meatballs. The trick with eating Num Pang is that you have to eat it immediately. In the 10 short minutes it takes to get your order and walk back to the office, the bread has already started to become soggy. If someone happens to catch you in the hallway you might as well say goodbye to your sandwich and hello to a (not so) hot mess.

This begs the question: why don’t more restaurants and takeout joints test the experience of eating food from their establishment? Prepare an order, put it on the counter for 20 minutes, and then have the chef eat it. Better yet, have the delivery person take it out with an order and then bring it back. I’d imagine most chefs would be surprised at the “presentation” and taste of their food after it has taken a bike ride through NYC in the winter. What was once a nice pad thai dinner will likely have turned into a cold ball of yuck.

Why don’t more restaurants do this? I think many restaurants are in for a rude awakening when they start eatability testing their menu.

Here are some of the most common missteps I’ve seen:

If you are serving something on bread (hamburgers, sandwiches, etc) then any sort of sauce or liquid on it is going to make the bread soggy within 5 minutes. If you can, just put the sauce in a container on the side. I’m sure some health-conscience customers would appreciate it as well.
If you have to serve it with sauce, make sure to put the sauce as far away from the edges of the bread as possible. Otherwise, the sauce leaks out of the sides and make the top and bottom of the bread soggy as well.
Separate hot and cold, just like you would at the grocery store. Styrofoam insulates surprisingly poorly, so packing a cold dessert on top of a hot soup is a bad idea. Account for leakage. Some well placed napkins can go a long way.
Account for the cooking and cool down that happen during travel. Talk to any chef in a food competition and they will tell you that residual heat can have a large effect on the taste of a meal. Account for the 10-20 minutes of delivery time for all to-go orders.
Test & iterate. There will always be surprises, so eatability test with different items on the menu to make sure that it is as good as you expect it to be.

Soundtracked: Understanding how we consume music

2011-07-10T22:32:00.000-07:00

Lately I’ve become fascinated with music, particularly how people talk about and discover new music. For the last few months I’ve been talking to just about anyone and everyone about music and I’ve learned some fascinating things which I plan to write about in a few blog posts. Here is the first set of observations I’ve made about how we consume music.

Music genres are useless

Just as an experiment, go ask a few people what type of music they listen to. Chances are they responded with something frustratingly generic like “I listen to rock music” or “I like hip-hop.” Given that, do you feel like you have a strong sense for that persons musical taste? Didn’t think so. While most people can describe their music tastes in terms of genres, for the person hearing those genres, they typically have very little meaning. This is becuase high level genres have become so overloaded that they could mean practically anything. The Foo Fighters (stadium rock), Pink Floyd (progressive rock) and The Decemberists (folk rock), all fall under the “rock” genre, yet in terms of sound they are miles apart. That is why I no longer ask people what type or genre of music they listen to, it's useless.

Finding out what people actually listen to

Now that we know music genres are useless, how do you actually ask someone what type of music they like and get a useful response? Initially I tried asking people what their favorite bands/artists were. Interestingly, I found that people struggled mightily with this question. There was often a good minute of hems and haws before I would get even a single band. After a lot of thought many people would toss out a few bands and then trail off with “yeah, I guess that's it...” Not happy with the result, I experimented with a few other questions.

The most effective way I’ve found is asking what are the last few bands/artists they’ve listened to. Most people have no problem rattling of the last three or four bands they listened to, and while its a biased sample, it's far easier for them to answer and much more useful then genres. For example, if someone tells me the last few artists they listened to were Fleet Foxes and The Freelance Whales I’d immediately know they they like softer indie rock, with great vocals, lots of harmonies, and acoustic guitars. It’s so much more accurate and useful then had they said “I like indie rock.”

New is a relative term

When people are sick of their current music they will often ask their friends if they know any “new music.” Most people interpret to mean any music that has been recently released. The reason they ask for this is that they want music they haven’t heard before, not because they have some aversion to older music. That's why even when people are asking for new music, you can generally interpret that to mean “new to them” rather than new in terms of release date. The beauty of music being “new to them” is that even if you know something is old, as long as they haven’t heard it yet, it is new to them. For example, when people ask me for Taking Back Sunday-style rock music I’ll recommend We Are Scientists - With Love and Squalor. Even though the album was released in 2005 the band was still pretty small at the time so most people didn’t (and still don’t) know about them. Score one for the relative new-ness of music.

That is all for now.

Web Application Caching Strategies: Generational caching

2011-06-05T20:33:00.000-07:00

This is the second in a two part series on caching strategies. The first post provided an introduction and described write-through caching.

In the previous post in this series on caching strategies we described write-through caching. While an excellent strategy, it has a somewhat limited application. To see why that is that case and how generational caching improves upon it, let's start with an example.

Imagine that you are creating a basic blogging application. It is essentially just a single “posts” table with some metadata associated with each post (title, created date, etc). Blogs have two main types of pages that you need to be concerned about, pages for individual posts (e.g. “/posts/3”) and pages with lists of posts (e.g. “/posts/”). The question now becomes, how can you effectively cache this type of data?

Caching pages for individual posts is a straightforward application of the write-through caching strategy I discussed in a previous post. Each post gets mapped to a cache key based on the object type and id (e.g. “Post/3”) and every update to a post is written back to the cache.

The more difficult question is how to handle pages with lists of posts, like the home page or pages for posts in a certain category. In order to ensure the cache is consistent with the data store, it is important that any time a post is updated, all keys that contain the post need to be expired. This can become difficult to track since a post can be in many cache keys at the same time (e.g. latest ten posts, posts in Ruby category, posts favorited by user 15). While you could try to programmtically figure out all keys that contain a given post, that is a cumbersome and error-prone process. A more clean way of accomplishing this is what is called generational caching.

Generational caching maintains a "generation" value for the each type of object. Each time an object is updated, its associated generation value is incremented. Using the post example, any time someone updates a post object, we increment the post generation object as well. Then, any time we read/write a grouped object in the cache, we include the generation value in the key. Here is a sequence of actions and what would occur with the cache when performing those actions:

By including the generation in the cache key, any time a post is updated the subsequent request will miss the cache, and query the database for the data. The result of this strategy is that any time a post object is updated/deleted, all keys containing multiple posts will be implicitly expired. The reason I said implicitly expired is that we never actually have to delete the objects from the cache, by incrementing the generation we ensure that the old keys are simply never accessed again.

Here is some sample code for how this could be implemented (in psuedo-Python):

After implementing this strategy in multiple applications I want to point out a few things about how it works and its performance properties. In general I have found that using this strategy can dramatically improve application performance and lessen database load considerably. It can save tons of expensive table scans from happening in the database. By sparing the database of these requests, other queries that do hit the database can be completed more quickly.

In order to maintain cache consistency this strategy is conservative in nature, this results in keys being expired that don’t necessarily need to be expired. For example if you update a post in a particular category, this strategy will expire all the keys for all the categories. While this may seem somewhat inefficient and ripe for optimization, I’ve often found that most applications are so read-heavy that these types of optimization don’t make a noticeable overall performance difference. Plus, the code to implement those optimizations then become application or model specific, and more difficult to maintain.

I previously mentioned that in this strategy nothing is ever explicitly deleted from the cache. This has some implications with respect to the caching tool and eviction policy that you us. This strategy was designed to be used with caches that employ a Least Recently Used (LRU) eviction policy (like Memcached). An LRU policy will result in keys the with old generations being evicted first, which is precisely what you want. Other eviction policies can be used (e.g. FIFO) although they may not be as effective.

Overall, I’ve found generational caching to be a very elegant, clean, and performant caching strategy that can be applied in a wide variety of applications. So next time you need to do some caching, don’t forget about generations!

Web Application Caching Strategies: Write-through caching

2011-06-05T20:25:00.000-07:00

You have probably heard about all of the big websites which heavily rely on caching to scale their infrastructure. Take a look at the Wikipedia entry on Memcached and you will find a veritable who's who of big internet companies. While we know these websites do caching, I have found very little written about how they actually do caching. Unfortunately a lot of caching is so particular to the individual website that it isn't very useful for most developers. There are however a few overarching caching "strategies" which if done correctly will guarantee you will never get stale data from your cache. In this series I'm going to discuss two of these strategies, write-through and generational caching.

The first and simplest strategy is write-through caching. What happens is when you write to the database, you also write the new entry into the cache. That way, on subsequent get requests the value should always be in the cache, and never need to hit the database. The only way a request would miss the cache is because a) the cache filled up and the value was purged or b) server failure. Here is some sample code in Python for how this works. I'm using database.get/put/delete as shorthands for SELECT/INSERT/DELETE in your database of choice.

The strategy itself is really simple to understand and for many workloads can result in dramatic performance improvements and decreased database load.

While a simple and clean caching strategy there are a few things you should be aware of to avoid some common issues when implementing this strategy.

Often times people will cache database objects by using the database id as the key. This can result in conflicts when caching multiple types of objects in the same cache. A simple solution for this is to prepend the type of the object to the front of the cache key (e.g. “User/17”).

Next, for any put/delete operations to the database it is important it check that those operations completed successfully before updating the cache. Without this type of error checking you can end up in situations where the database update failed but the cache update happened anyways, which results in an inconsistent cache.

While this strategy is effective for caching single objects, most applications fetch multiple objects from the database (e.g. get all books owned by "Joe"). For a strategy which handles multiple objects, see the next post in the series about generational caching.

2011 PaaS Predictions: Winners and Losers

2010-12-25T21:21:00.000-08:00

This past year has been an interesting one in the cloud space, and particularly for Platform-as-a-Service. While PaaS has certainly increased in popularity over the past year, I think it is still at the bottom of what is going to be a hockey-stick like growth curve. That obviously means that 2011 is going to be an even bigger year for PaaS. Given that is the case I wanted to put down my predictions for what I think it going to happen in the space over the next year.

Heroku

Heroku’s $212 million dollar acquisition by Salesforce was probably the biggest news of the year in the platform space. Not only is it a huge win for one of the most popular platforms, it is also a validation of the fact that the space is going to be huge. However, any time a small agile start up gets bought up by a giant corporation there are obvious concerns that they will lose their agility and get out of touch with the customers who made them successful. Given what I’ve seen from the Heroku guys so far I think they have a strong enough vision and understanding of their customers that they will be able to fend off the temptations of SF execs to meddle with their newly acquired toy.

I also suspect that we will see Heroku start getting a little more enterprisey in their offering over the next year. Unfortunately I don’t think they will see a lot of uptake since most enterprises move slower than molasses on a cold day.
Overall: OK

Google AppEngine

In addition to having incredibly liberal quotas, AppEngine has long been one of the most feature rich platforms available. Yet, there have been two main issues which really held back what would have otherwise been incredible growth, performance and datastore.
One of the most common complaints I have heard from developers using AppEngine is poor performance. This issue came to a boiling point earlier in the year when datastore latencies got so bad that Google stopped charging for Datastore CPU usage. The datastore issues were subsequently resolved and Google was able to deliver a substantially faster datastore API. Following this issue I think over the next year Google will start drilling down on other performance issues and deliver a much more performant and stable platform.

The other main point of friction for datastore has been the fact that it relies on a NoSQL like data model. While NoSQL is all the hotness right now, the vast majority of developers are still not comfortable using non-relational data models. That is why the announcement of a SQL database as part of AppEngine for Business came with much celebration. While it is still in private beta I would suspect that we will see it publicly released this year which will bring in a new wave of customers ready to take advantage of the large quotas and familiar SQL tools.
Overall: Very good

Djangy

While AppEngine is a great platform for Python applications there has been pent up demand for a Python/Django version of Heroku. New startup Djangy looks to be the first viable attempt at filling that gap. Looking at their documentation they seem to have taken the “Heroku for Python” manta very seriously, and that is a good thing. While Djangy is currently in private beta I’m guessing that they will be gearing up for public launch in the next 3-6 months. After that I think we will see quite a few Pythonistas migrate over to Djangy so they can get back to using the tools they are familiar with (Full Django + SQL).
Overall: Very good

DotCloud

DotCloud is a new Y-Combinator startup which aims to bring more flexibility to the platform space. Rather than a platform being married to a single language and database technology they aim to allow you to mix-and-match many languages (Ruby, PHP, Java, Javascript) and database technologies (MySQL, PostGREs, Redis, MongoDB). It is the choose your own adventure version of cloud platforms. Unfortunately I think using Dot Cloud will be just that, an adventure, and not the good kind.

As I discussed in my previous blog post, fragmentation is a huge issue for platform providers. I was discussing fragmentation within the context of a single language but the problem absolutely explodes when you start mixing in multiple languages with multiple databases. Consider the fact that even Google, a company with a tremendous amount of technical resources, only supports two languages and one database. While their vision is great one, I think they are going to run into some issues. In particular, they are going to reach a very unfortunate fork in the road. One path will be to only do barebones support for each language/database and risk frustrating customers with their lack of options. The alternative path will be to provide more in-depth support for each component and have to live the burden of regression testing the myriad of combinations possible, destroying their agility. Frustration seems inevitable either way.
Overall: Poor

Platform as a Service (PaaS) Fragmentation

2010-12-24T23:06:00.000-08:00

You will find no shortage of cloud pundits around that will tell you that the Platform layer is the future of the cloud stack. The thought is that in the future we will be able to forget about the menial tasks of managing servers and just worry about developing applications. While a pleasant vision, let’s take a walk through the cold harsh reality.

In the PaaS space there is a big elephant in the room, it goes by the name fragmentation. Stop and think about it for a second, how many popular programming languages are there? How many popular frameworks does each one of those languages have? That's not to mention the hell that is versioning of both of those things.

Lets look at a quick example to illustrate this point. You’ve decided you want to create a platform, the first question is then what language you want your platform to support. After considering the options you decide on your language of choice. Depending on which part of the language release cycle you are in you are probably looking at a decent split between the current version, the previous version, and the experimental new version. Unless you get lucky in terms of timing or are particularly aggressive in deprecating versions, you are probably going to want to support all three versions. We see this exact trend in both the Ruby and Python worlds. Note that I haven’t even considered any alternative runtime implementations (e.g. Ruby Enterprise Edition, Python Unladen Swallow). Next is the choice of web framework(s). Almost every popular web programming language has a myriad of web frameworks to choose some. You have the most popular framework (Rails, Django), the minimal framework (Sinatra, web.py), and a whole bunch of other long tail frameworks. Depending on the language you could easily be looking at 3-5 viable contenders for frameworks, not to mention versioning. I could keep going, but I think you get the point.

This ties into a recent Twitter conversion where @georgevhulme posed the question:

“Who will win the Paas battle next year, and become the dominate platform?”

In my opinion, the answer to this question is none. No single company will become the dominant platform in the next year, or even next five years for that matter. The PaaS space is simply too fragmented for any one company to own a substantial portion.

Better Cache Busters (aka Asset Timestamps)

2010-09-16T22:01:00.000-07:00

If you read any article on web performance they will almost undoubtedly mention expire-headers. They effectively tell the browser not to bother asking the server if an asset (e.g. an image or stylesheet) has changed until the expiration time. Ideally they should be set for many years in the future so that client browsers aren’t spending tons of time waiting for 304 Not Modified responses. The problem with expires headers is that when the file changes you need some way to signal the browser to fetch the updated file. The general trick is to change something in the URL which makes the browser think it is a new file and hence fetch it from the server. This is often called a cache-buster. That is why when you pop open Firebug you will often see files named puppies_3123141.png or styles.css?12345678.

Now that we know something needs to be added to the URL, the question becomes what should we add? There are three properties that we are really interested in:

1. The value should (only) change when the contents of the file changes
2. The value should be consistent across different machines
3. It should be fast to compute

The first possibility is to take the approach used by Ruby on Rails which is to use the modified time of the file. While this satisfies property 3, it does not work for 1 or 2. It breaks property 1 because appending an empty string to a file causes it modified time to update without the actual contents of the file changing. Property 2 is a much larger issue which many people have to face. Static assets are generally served from multiple machines, that means the modified time needs to be consistent across all of them. This is very difficult to achieve, particularly when the files are under version control. While modified time isn’t a great solution, it does bring us to another possibility, version control ids.

One of the great things about version control systems is that in order to give you a reference to a particular commit, unique ids have to be generated for each one. Given that is the case, we can simply use the current commit id (or hash for you git’ers) as the cache buster. We are currently on commit X, I update a file and commit it, we are now on commit Y. Since all machines will be updating their files from version control, everyone should be on the same commit id. In terms of speed, its not necessarily the fastest (especially if you are using SVN) but it only needs to retrieved once since it is the same for all files. The problem with commit ids is that it violates property 1. When a single file changes in the repo, every files’ id changes. That means that every time you deploy new code for your webapp, each client is going to fetch all the files again, even if all you changed was a README. Getting better, but still leaving something to be desired.

The last possibility I am going to talk about is an oldie but goodie, the MD5 hash. The cache buster of an asset is simply the MD5 hash of the asset itself. It satisfies property 1, and as an added bonus if the file is ever changed and then rolled back the MD5 hash will roll back with it (git reset anyone?). Property 2 is no problem, the contents of the file is the same across the different machines, hence so is the MD5. The only thing left is property 3, speed. Clearly computing the MD5 is going to be more time consuming than fetching modified time. However, just about every language has a standard hash library written in C for computing MD5’s and its pretty darn fast. The only place I could see this being an issue is if you have very large files or a extremely large number of them. Even still, you can just write a deploy script that precomputes all the hashes beforehand.

Overall, using MD5s as a cache buster give you all of the nice properties you could want in a cache buster with very little drawback. I went ahead and wrote a monkey patch for Rails that changes the asset id method to use MD5’s, the source code is available here (http://pastie.org/1164279). Enjoy busting caches.

Twitter is a bar and other Social Network metaphors

2010-06-16T11:26:00.000-07:00

I have spoken with a few people recently about social networks and how they compare. In explaining it, I’ve settled on a few metaphors that most people seem to understand pretty intuitively. Here goes...

Twitter is a bar

Twitter is your neighborhood Irish bar on a Friday night. It’s busy, noisy, and there are tons of conversations happening simultaneously. This makes Twitter a great place for short, and casual conversations, particularly since there are a bunch of people around to talk to. With the many conversations happening within ear-shot, it’s easy to catch something of interest and hop in to another conversation at a moments notice.

However, the raucousness makes Twitter a difficult place to have an extended conversation. It’s certainly not the place to (at least easily) discuss the intricacies of foreign policy or War and Peace. The people who try to engage in these conversations generally are quickly frustrated and leave for other establishments.

Facebook is a coffee shop

Unlike the raucous and noisy crowd of Twitter, Facebook presents a much quieter and civilized environment akin to that of your local coffee shop. Plush and comfortable surroundings make Facebook a nicer environment for conversing with friends. Conversations can flow freely but be careful what you say since your parents or kindergarten teacher may walk in at any moment. While some people are there to chat with friends, others are simply hanging out and taking advantage of the free amenities (e.g. photos).

LinkedIn is a library

Last but not least, LinkedIn is your local library. Walking into the library the silence is almost deafening. The whispers of others serve as reminders that a few people occasionally visit this establishment. The library’s biggest asset is that it’s filled with valuable and accurate historical information. Yet, looking as this information you get the sense that it’s stale and rarely updated. While you may not be able to find out the latest news trends, if you want to know where your competitor’s CEO went college, this would be a good place to look.

So where do you want to hang out?

Killing Multithreaded Python Programs with Ctrl-C

2010-05-17T20:54:00.000-07:00

If you have ever done multithreaded programming in Python you have probably found it frustrating that you can't simply hit Ctrl-C in the terminal and have it exit like a normal Python process. Instead you have to put the process in the background (Ctrl-D) and then either "kill %%" or kill the PID. The good news is that it doesn't have to be this way. After experimenting a bit I finally figured out why it doesn't work normally and what you have to do to make it work.

Normally when I write a threaded program in Python it looks something like this...

The problem is with this program is that if you hit Ctrl-C it doesn't do anything. The reason is that join() is a blocking operation. As a result the process will only receive the signal for Ctrl-C when join() becomes unblocked, which in this case will never happen.

In order to handle Ctrl-C with multiple threads you can use the following code:

This code does a few things differently in order to make it handle Ctrl-C as we would like. First, instead of using join() we use join(timeout) which tries to join a thread but will timeout if it does not occur after the timeout elapses. This allows the main thread of execution to continue doing other things, in particular waiting for a KeyboardInterrupt to be thrown which is what Ctrl-C raises. Since join will return upon timeout, we need to keep any threads which aren't None and respond to isAlive().

The next thing is that if child threads never return or take a really long time you need a way to notify the child that it should die. This is accomplished by the kill_recieved flag in the Worker class. When that flag is set by the parent process the child knows that it should finish up what it is doing and return.

The last thing is something that caught me off guard a bit. Initially, in the main() while loop I was trying to catch all exceptions that came up by using "try...except Exception:". As it turns out Exception does not include KeyboardInterrupt, meaning that Ctrl-C's that are raised in that block will not be caught. If you instead use "try...except KeyboardInterrupt:" or just "try..except:" it will work as you expect it to.

So there you have it...how to exit multithreaded Python programs using Ctrl-C.

Using Memcached as a Distributed Locking Service

2010-05-03T16:29:00.000-07:00

One of the beauties of memcached is that while its interface is incredibly simple, it is so robust and flexible that it can do nearly anything.

As an example, I am currently working on a project where we need to do distributed locking. The reason for this is that we are running into what is generally know as the lost update problem. Basically, we need to read an object, update it, and write it back in a serialized fashion in order to ensure that no updates are lost. The easiest solution is to have a lock which must be acquired in order to do the read/update/write operations. Locks are great with a set of threads but once you break out of the context of a single machine, you need something distributed. While there are certainly other library or complex pieces of code that do this, I find this to be a pretty elegant solution for a set of nodes which have access to the same memcached servers.

Provided below is a class called MemcacheMutex which provides the standard acquire/release mutex interface but with the twist in that it is backed by memcached.

There is one big caveat. Since memcached is not persistent it is possible that the lock could get evicted from the cache. This could result in a case where P1 acquires the lock, the key gets evicted, then P2 is able to acquire the lock even though it has not been released by P1. If your memcached is doing tons of operations or you are holding onto the lock for really long periods of time then this could become an issue. In that case, you should use persistent version of memcached, memcachedb.

The Cloud Support Conundrum

2010-04-16T15:28:00.000-07:00

Two nights ago I read an interesting tweet by @jclouds (Adrian Cole) that I think deserves some discussion. It is shown below:

For the record RightScale supports Rackspace, GoGrid, and Eucalyptus in addition to AWS. However, I think there is a bigger issue at hand which is how many clouds should cloud management platforms really support?

I'm going to take the opposite stance on this issue. I actually think it is quite impressive that cloud management platforms in general support as many clouds as they do. From a financial perspective you only really need to support one cloud (AWS) since it is dominates usage by far. However, I think most people agree that putting all of the eggs in one basket is foolhardy. So the next logical move is to support the next biggest player in the market, Rackspace. That is why support for AWS and Rackspace is ubiquitous across all cloud management services (RightScale, enStratus, cloudkick). At that point even if one of your legs gets kicked out from under you, you still have another one to stand on.

Aside from AWS and Rackspace, there are a smattering of "other" clouds for which adoption and usage has been small. Case and point, in addition to AWS and Rackspace, the three aforementioned cloud management services support a total of nine additional clouds. Yet, the only overlap across those nine clouds is RightScale and Cloudkick supporting GoGrid. The rest are unique to that particular service.

While I would like to think this that the divergent opinions in cloud support are a result of too many awesome clouds being available, that is simply not the case. Instead, these platforms have the unpleasant task of trying to find clouds that are sufficiently feature rich enough to support actual usage. While most clouds seem to be improving in providing the endemic cloud features, this process can be slow and leave much to be desired.

The alternative is customers asking management platforms to support some of the other clouds. While this can and probably does happen, I can't imagine it happens in a sufficient enough quantity (or with sufficient $) to make it a worthwhile venture. If you are ask customers they would love it if you added support for <some obscure cloud only they use> or even better <some cloud their son is building as a science fair project> but lets be realistic, it's not gonna happen.

With that being said, I do think there will be more diversity in cloud offerings, just not as quickly as most cloud folks want it happen. At this point the focus should be getting people onto the cloud, not a cloud.

Unknown Substitution Cipher Interview Question

2010-03-28T20:45:00.000-07:00

You are given a file containing a list of strings (one per line). The strings are sorted and then encrypted using an unknown substitution cipher (e.g. a < c, b < r, c < d). How do you determine what the mapping is for the substitution cipher? The unencrypted strings can be in any language.

After a few questions the interviewer clarified and stated that the cipher was a simple substitution cipher where every letter in the alphabet is mapped to one and only one other character in the alphabet.

Given this problem the simplest solution (not counting brute force) would be to do frequency analysis. If the strings contain words or phrases in a popular language there is a reasonable shot that frequency analysis would at least get you started figuring out the cipher. After discussing this and how it would work the interviewer told me "you do frequency analysis and cannot draw any conclusions, what do you do then?"

Ah, back to the (metaphorical) drawing board. The next big clue is the fact that the strings are sorted. This type of interview question is generally worded very specifically so when what seems like in extraneous factor is "thrown in" it probably means something. In this case the fact that the strings are sorted tell you something, in particular, precedence of the letters. For example, given the following strings (which are sorted in ascending order)

ggcaa
ggpqr
grfzu

you can start to learn about the ordering of the encrypted letters. For example, in the above example we know that c < p from the first two strings and g < r from the second two strings. We can only use the first letter that differs from any two pairs of strings, everything afterward is inconsequential. Doing this over the entire file you would get a set of relationships (e.g. c < p, g < r, q < z, z < t). The next questions are then "what can you do with this data and what is the most appropriate method for storing it?"

This is where things get tricky. After going through some of the common data structures it seemed like a tree, in particular a binary tree, might be the right fit. The problem with a binary tree is that you don't necessarily have all the required information to create the tree. For example, say the root of the tree was the letter Z and it had a left child Q. Now say you are trying to insert the letter G. You may know that G < Z so it needs to be in its left sub-tree, however you may not know what the relationship is between G and Q. In going through the list of strings, you are not guaranteed to have all relationships between all letters.

Without all the relationships you are forced to fall back to a graph structure. Each letter becomes a node in the graph and each greater than relationship becomes a directional edge in the graph (g < y would have an arrow pointing from node g to node y). With the graph populated with all the edges, now what do you do? The interviewer assured me that each node is part of the graph (i.e. there are no partitions).

This leads to one of the tricky parts about graphs. Since all nodes are equal, where do you start? There is no "root" node like there is in a tree. In this case you want to start with the node with no incoming edges, this is the smallest letter. Such a node has to exist since there cannot be cycles in the graph. You can then delete that node from the graph and then repeat this process until there are no more nodes in the graph.

Now for the fun part, what is the complexity of this algorithm? Finding the smallest node in the graph is N complexity for the first node, N-1 for the second node, and so on which leaves you with N+(N-1)+(N-2)+...+1. This is can also be written as the summation from 1 to N of i which equals (N*(N+1))/2 (see here if you don't believe me). This means that the algorithm is O(n^2).

Enjoy.

NoSQL: The Paradox of Choice

2010-03-02T23:41:00.000-08:00

I have been watching a lot of TED talks recently and while there are a ton of excellent talks, the one I want to talk about today is called "The Paradox of Choice" by Barry Schwartz (he has a book by the same title). While watching his talk it struck me how relevant the topic is to NoSQL databases. I want to talk about these connections and their implications.

More != Better

I want start with an idea that many people know intuitively but is still worth calling out explicitly. Having more options is not necessarily a good thing, in fact, in many cases its a bad thing. In his talk Schwartz cites a study which showed that for every additional 10 mutual funds offered by an employer, participation went down by 2%. As the number of options increases, the amount of time and effort required to make a decision increases significantly, to the point where you are put into a state of paralysis as a result of the myriad of options. For all intents and purposes the birth of the NoSQL database came with the publication of the BigTable paper by Google and the Dynamo paper by Amazon circa 2006. Since then, the number of NoSQL databases has gone from two, to five, to 29. That's right, http://nosql-database.org/ currently lists 29 different NoSQL databases each of which has slightly different feature set and benefit/drawback trade offs. Good luck picking the right one.

More Options => Higher Expectations

When many options exist it is only natural for us to expect that one of them hasto have the features you are looking for. With only one option it doesn't matter if it has the features you want since you don't have a choice. Prior to the NoSQL movement there was only one game in town and it was the relational database (MySQL/PostGREs/MSSQL), so you had no choice but to grin and bear it. However, now that there are almost 30 options (there probably will be by the time I'm done with this post) one of them has to have the right mix of peanut butter and chocolate to fit my tastebuds. Unfortunately this rarely turns out to be the case.

Higher expectations not only apply to features, but also performance. The promise of infinite scalability will draw a lot of eyes and ears, but in order to win them over you have to show users tangible benefits. I cant count how many blog posts I have read about people/companies who are using MySQL as a key-value store because it is faster then the key-value stores themselves. In order for NoSQL databases to win, users need to be able to benchmark the database and be impressed by its performance.

What have we learned from this? For all the hype it is getting, NoSQL is in a rough spot. The number of options is large and there aren't clear winners in any category. A few of the databases seem to be rising to the top (Cassandra for instance), but until enough people get behind a relatively small number of databases (I say pick 3) the knowledge, tool support, and amount of helpful information available online is going to be to painfully low.

Peering In The Clouds

2010-02-16T21:55:00.000-08:00

An idea that I have been kicking around in my head for some time now is that of cloud peering. This idea comes from the networking context whereby big ISPs will negotiate a peering agreement that allows customers on the two networks to communicate with each other. While the big ISPs are cut-throat competitors they understand that if they do not work together that the Internet would turn into large islands where customers can only communicate to others on their island.

Here is what Wikipedia has to say about the benefits of peering:

Peering involves two networks coming together to exchange traffic with each other freely, and for mutual benefit. This 'mutual benefit' is most often the motivation behind peering, which is often described solely by "reduced costs for transit services". Other less tangible motivations can include:

Increased redundancy (by reducing dependence on one or more transit providers).

Increased capacity for extremely large amounts of traffic (distributing traffic across many networks).

Increased routing control over your traffic.

Improved performance (attempting to bypass potential bottlenecks with a "direct" path).

Improved perception of your network (being able to claim a "higher tier").

Ease of requesting for emergency aid (from friendly peers).

Are those not things that every cloud infrastructure provider would love to have more of? More redundancy/capacity/performance seems compelling to me.

The way I would envision such a system working would be for two infrastructure providers to agree to provide free bandwidth between their clouds. For why this is important, think about the cloud today. When trying to transfer data between clouds what ends up happening is you charged going in both directions (i.e. out of one cloud, into the other). Say for example you were running an application in the Rackspace cloud but doing nightly DB backups to S3 for redundancy, you would end up paying 22 cents/gig to Rackspace for the outbound traffic and another 15 cents/gig for traffic into S3. At 37 cents/gig you could quickly start racking up a nice bill just for bandwidth. Cloud peering could alleviate a lot of this cost while attracting more customers to the cloud by reducing the concern over lock-in and outages.

This strategy seems particularly attractive for some of the smaller players which are looking to grow market share. It will be interesting to see if anything like this develops in the future.

Designers Guide to Web Performance

2010-01-24T22:42:00.000-08:00

I have found myself reading quite a bit about usability and web design recently. While I have learned a lot about design I have also learned that there is a large portion of the design community which is not terribly familiar with how they can improve performance of the websites they are creating. While a pixel perfect layout is a beautiful thing, no one wants to wait 20 seconds for it to load. Since the web design community has taught me so much, I wanted to give back by writing this post on some simple techniques for improving site performance.

Welcome to Web Performance 101, here is your first assignment:

The above is a waterfall chart which shows how long the pieces of a particular website take to load. To be nice I've blanked out the website URL's but I left the file extensions since they will come in handy later. What this chart shows us is that initially the browser made a request for the index ("/") page and 1342 ms later it had received all of HTML for that page. One second in the browser makes a request for the first javascript file which takes 585 ms to load. So on and so forth for about five seconds. While there was more to the chart I cut it off for brevity.

Progressive Rendering

You probably have noticed that often times your browser starts drawing the website before the entire thing has loaded, this is called progressive rendering. This is really nice because if done correctly it can make your website "feel" much faster for end users. In order for the browser to start drawing the page it needs two things, the HTML and the CSS. Until the HTML/CSS are downloaded your users are going to be staring at a blank white screen. This is why it is really important to put CSS in the head of the document before javascript and images. In the chart above there is a vertical green line at about 2.25 seconds, that is when the browser could start rendering the page. Had the designer of the site put the CSS before those five javascript files the page could have started rendering at 1.5 seconds or even earlier. Its such a simple an easy change, that there really is no reason not to.

Javascript: The Performance Killer

While every website is going to have a waterfall pattern the goal is to have a very steep waterfall. The further the bars extend out to the right of the graph, the longer the site takes to load. Hence, we want to shoot for a very steep and short waterfall which should translate into a fast site. Unfortunately this chart looks more like stairs then a waterfall, the reason is javascript.

While it may not be obvious to you, browsers are pretty smart. For example, when fetching the images on a website most browsers can download more than one at a time. Older browsers (IE7, Firefox 2) usually download two items at a time and newer browsers (IE8, Firefox 3) download 6 or more at a time. Notice how there is overlap between items 8 & 9 as well as 10 & 11, those items were being downloaded in parallel. Then notice how there is no overlap between items 2-7. This is because unlike images or CSS, javascript blocks the browser from downloading anything else. While newer browsers tend to do a better job at this and download them parallel, most of the world is still running IE7 which is what was used to generate the above chart.

One way to improve this would be to take all the separate javascript files and combine them into a single larger file. This will reduce a lot of the overhead incurred by download many files. Note how most of the bar of the javascript file is actually green, that is "time to first byte." It is the time between when your browser say "Hey give me foo.js" and when it finally receives the first byte of that file. You can barely even see the blue lines at the end which is the time spent downloading the actual data. By combining those javascript files together you cut out most of the time spent waiting for the data to come back. There are many tools online and available for download which will automatically concatenate the files together for you.

Zipping it Up

The easiest way to ensure that your site loads fast is to send less data. Less data means less time required to download it which means better performance. This is where data compression comes in. Most big web servers provide the ability to compress the data it sends out into a much smaller format so that it can be transferred more quickly. Then once it reaches the browser it is smart enough to know that it is compressed and it will uncompress it and read it just the same way it normally would. The most popular compression algorithm used today is called GZIP, and it can often cut file size in half or even a third. It is important to note that this compression is only effective on text data, so you want to gzip your HTML, CSS, and Javascript files. You don't want to gzip images since it usually eats up more server CPU then its worth. The best thing about gzip is that it only requires a few extra lines in the server config file and then your done. This website has a good guide for how to enable it for Apache.

As I mentioned above it is often a good idea to combine all of your javascript files into a single file, the same holds true for CSS. So if you have your css split across many files, combine it into a single file so that it loads faster. An extra bonus to this is that compression rates tend to get better as files get larger. Thus while the amount of text is still the same you will probably end up transferring less data and skipping much of the time wasted waiting for the data to arrive.

That's where we will stop for today. Hopefully you learned a few things about web performance which you can apply to websites you are working on. Keep in mind there are a lot more tips and tricks for improving performance, this is only the tip of the iceberg.

Private clouds are transitional

2010-01-17T18:37:00.000-08:00

Given that I am going to be talking with the Eucalyptus guys in a week I figured it was a pretty opportune time to sit down and really think about private clouds. Much to the chagrin of some people, I firmly believe that private clouds are still clouds. In my mind the real question is where do private clouds fit in the cloud ecosystem?

Lets get one thing straight, private clouds are transitional. What I mean by this is that private clouds are not going to be here forever, they are instead filling in a very important gap in the current cloud landscape. While a large number of servers could be moved into the cloud today, there are still quite a few use cases which don't allow for such rapid change.

To illustrate one such example, lets say you head IT for some Fortune 500 mega-corporation. After a few months of waiting your purchase order for X racks of servers (finally) goes through. You estimated that with the servers you have ordered you should have sufficient capacity for the next year. Shortly thereafter you reassess the opportunities provided by the cloud and deem it fit for use by your company. So enthusiastic about the change you even decide to join Lew Moorman and Marc Benioff on stage at some cloud event to chant "No more servers!" and pledge never to buy another one. However, upon returning to the reality of work you recall that in addition to the servers you currently have running in your data center, the new racks you just ordered are going to provide you sufficient capacity for the next year. Assuming a 3-5 year lifespan of the average server you are looking at at least two years before you can move a majority of your servers into the cloud. Enter the private cloud. You already have the resources, there is no reason for them to sit around and gather dust. Make your own private cloud with the existing resources and expand out into the public cloud in as necessary. Once the servers in your data center have run their course, toss them and move to the cloud.

The other big elephant in the room with regards to moving to the public cloud is compliance. These days it seems like every big industry has its own compliance and regulatory constraints that must be met. Whether it's PCI for credit card processing or HIPAA in the health fields, almost none of the big cloud vendors have met the requirements for becoming compliant. In fact, its not even clear that they are trying. Unfortunately, regulation is not something that can be skirted around, it is a big time show stopper. This means that companies in industries which have regulatory requirements are going to be in a holding pattern around the public cloud for the foreseeable future. What's the next best thing? That's right, private cloud.

While private clouds are transitional, they will be around until the aforementioned issues are addressed. For some this may be on the order or months or a year, but for others it will probably be much longer. The regulatory issue in particular is not an overnight fix, I think it is going to be a big parachute that prevents the cloud from running at full sprint for the year. So while the cloud is transitional, that transition period is looking like its going to be a long one.

Rails named_scope Time Surprise

2010-01-05T10:19:00.000-08:00

I have written previously about how much I like named_scopes in Rails and I still do very much. After using them for some time I got tripped up on an issue that came up with them which surprised me a bit. I thought it would be a good idea to document it here in case others have the same issue.

To demonstrate the issue, lets say I have an app that has a User model. On the home page I want to display a list of the users which have signed up in the last hour. This is an excellent use case for a named scope. We can start by creating a named scope called "recent" which will then allow us to simply say "User.recent" to retrieve all the recently created accounts from the database. This seems simple enough so I went ahead and wrote it up as follows:

Now you will notice that I named it recent_bad, and that is because this named scope is BAD! Take a look at the queries generated when I call recent_bad three times, notice anything wrong? Its subtle. Note how the date after created_at "2010-01-05 16:55:44" never changes. For effect I made the model acts_as_paranoid so you can see what the timestamp should be. What is happening here? The named_scope is at the class level, that means that the when the User class is loaded the Time.now.utc is evaluated once and then never again. This is why the time only changes when the server is restarted. In order to avoid this issue simply put the condition within a lambda as follows:

Now you will see that the created_at time is updating as it should be. Its a subtle bug and one that caught me by surprise.

Slashdot loves Cloud Computing!

2010-01-03T11:59:00.000-08:00

One of the reasons I love reading Slashdot is because the inhabitants of their community are unlike any other. This morning I woke up to an article on the front page about a venture capitalist who was defending the incredibly popular social game Farmville. Being interested in the topic I thought I would take a gander at the comments, that is when I came across this gem (direct link):

   This needs to be the year that those of us with even the slightest degree of technical knowledge take a stand against the goddamn "Cloud".
   It sounds fantastic in theory, but once in the real world, Cloud Computing falls flat on its face. My development and ops teams wasted too much time dealing with Cloud providers over the past year. So my resolution this year is to tell anyone who proposes the use of anything Cloud to cram it. We aren't doing it any longer. It's a failed approach.
   Just last week, during the holidays, we had to scramble after one of our Cloud providers ran into some hardware problems and couldn't get our service restored in a timely manner. After the outage exceeded my threshold, I called up my best developers and had them put together a locally-hosted solution in a rush, and payed them quite a bit more than usual due to the inconvenient timing. Then I called up the Cloud provider and basically told our rep there that we are done using them and their shitty service. Then I called up the manager in our company who recommended them, and told him to basically go smoke a horse's cock.

The commenter was apparently so proud of their work that they decided to post it anonymously. Now keep in mind that the article was about social gaming, it had nothing to do with the cloud. While Farmville does run on Amazon EC2, the article does not mention or discuss that at any point. Regardless, lets take a look at this comment and pretend that it was posted in a reasonable context.

Perhaps my favorite thing about the comment is the fact that it makes these huge substantive claims yet provides absolutely zero reasoning behind them. For example, "It sounds fantastic in theory, but once in the real world, Cloud Computing falls flat on its face." Really? In what sense? You must mean Animoto scaling form 40 to 4,000 servers in 3 days. Or how about the millions of people who are using Google Apps. Both of those are certainly real world and as far as I can tell there was very little falling on faces. In fact, the cloud helped Animoto avoid falling on its face!

As if the first claim wasn't enough the end of the second paragraph provides a real doozie. It completely writes off Cloud Computing by stating that "[i]t's a failed approach." Well I'd better get on the phone and tell all the people who put 64 billion objects into Amazon S3. That might take a little while...

The last paragraph at least provides a little bit of background on why this particular individual will "tell anyone who proposes the use of anything Cloud to cram it." (Well if he is going to tell them all to cram it maybe he can make all those calls to the S3 users for me...). This is where I start to get a little empathetic. Downtime sucks, it really does. Customers and providers can both agree that downtime sucks since everyone loses when it rears its ugly head. My question is this, if this service is so critical then why wasn't it built to be fault tolerant? If you are truly concerned about availability then you need to either a) build the service such that it can withstand failure or b) have an SLA in place with the provider. But honestly, who wants to do that? Instead I would recommend following the actions of the commenter which is 1) don't take the necessary precautionary steps to avoid downtime and then 2) complain when there is downtime. Whats next, eating three Big Macs a day and then complaining when you need triple-bypass surgery?

Lastly, I would like to commend the brave commenter for being brazen enough to tell a manager at his/her company to "basically go smoke a horse's cock." I can see your career blossoming as we speak.

Numbers Everyone Should Know

2009-12-13T16:17:00.000-08:00

I was looking over a presentation titled "Designs, Lessons and Advice from Building Large Distributed Systems" By Jeff Dean from Google when I came across a really useful slide. Its the 24th slide in the presentation and is titled "Numbers Everyone Should Know". It has the latency of some common processor/network operations:

L1 cache reference..............................0.5ns

Branch mispredict.................................5ns

L2 cache reference................................7ns

Mutex lock/unlock................................25ns

Memory reference................................100ns

Compress 1K bytes with Zippy..................3,000ns

Send 2k bytes over 1Gbps network.............20,000ns

Read 1MB sequentially from memory...........250,000ns

Round trip within datacenter................500,000ns

Disk seek................................10,000,000ns

Read 1MB sequentially from disk..........20,000,000ns

Send packet CA->Netherlands->CA.........150,000,000ns

Having numbers like these are really useful for just ballpark estimates. I think the biggest surprise to me was the huge disparity in a datacenter round trip to disk seek. We all know how slow the disk is, but the fact that you can make 20 round trips within a data center in the time it takes just to make a disk seek (not even reading any data!) was pretty interesting. It reminds me a lot of Jim Grays famous storage latency picture which shows that if the registers were how long it takes you to fetch data from your brain then disk is the equivalent to fetching data from pluto.

Be Wary of the Paranoid

2009-11-25T18:47:00.000-08:00

I recall distinctly when first learning about Rails and plugins that the very first one I used was acts_as_paranoid. Something about actually deleting data concerned me and so I figured adding acts_as_paranoid to some important tables in my application would save a lot of headaches. While it is tremendously useful it also has a pretty big unintended consequence that I think gets overlooked by most. Lets take an example query generated by acts_as_parnoid.

>> User.find_by_id(7)
SELECT * FROM `users` WHERE (`users`.`id` = 7) AND (users.deleted_at IS NULL OR users.deleted_at > '2009-11-26 02:17:28')

Looks harmless right? What I want to draw attention to is the "OR users.deleted_at" part. In order to ensure that the user isn't deleted it checks not only that the deleted_at field is NULL, but also that deleted_at is greater then the current time. In reality the IS NULL check is sufficient unless you are setting the deleted_at of some object to be in the future. I have yet to see anyone actually use it in that way. This is what makes the use of the current time in the query so bad, it is slowing down tons of Rails applications and most people don't even know it.

One important thing to notice about the MySQL query cache is that it is pretty dumb. Basically it caches the incoming query string exactly as written and then stores its associated result set. This becomes a problem when you use something like the current time in the query string, it functions as a cache-buster each second. So at 0 seconds you make a query and it is stored in the cache, then at 0.5 seconds you make the same query and it is read from cache, then at 1.0 seconds you make the same query but it will miss the cache since the time has increased by a second. This means that anything written to the query cache which uses acts_as_paranoid effectively has a 1 second expiration time. That's awful , and all that for the 0.005% of users who want to expire things in the future. Not to mention the fact that it completely pollutes the cache with old data which never gets touched a second after its written.

Alright, enough moaning, here's how to fix the problem. Open up paranoid.rb and in the "with_deleted_scope" function rip out "OR #{table_name}.#{deleted_attribute} > ?" along with the current_time variable after it. Similarly in has_many_through_without_deleted_assocation.rb in the construct_conditions method delete the same string where it is appended to the conditions variable. Keep in mind that if you are setting deleted_at to values in the future then you don't want to make this change. But for everyone else, enjoy the improvement in your query cache hit rate.

As a final note, for the tables which you have made paranoid you probably also want to consider adding an index which includes the deleted_at field since it will be a condition of every SQL query on that table.

Updated: There is a fork of acts_as_paranoid courtesy of mikelovesrobots that provides the fixes that I talked about previously, it is available here. I'm gonna switch out my versions of acts_as_paranoid for this one, I'd suggest you do the same.

Lacking in Persistence

2009-11-24T22:29:00.000-08:00

A few weeks back was the very first incarnation of Cloud Camp in the Cloud. While I was unable to attend I did get around to watching the screencast of it courtesy of @ruv (available here). There was certainly a bunch of good information and intelligent discussion but I found one question to be particularly interesting and insightful. The astute attendee asked "Why was Amazon EC2 designed such that instances have transient (ephemeral) storage? Rackspace has been pushing their marketing on the fact that their servers have persistent storage is this a big deal?" That is certainly a loaded question but I think I'm in a reasonable position to take a shot at it.

Lets start by talking about some of the reasoning behind why Amazon would make their instances transient. The first reason is simple, making instances transient makes life a lot easier for them, particularly given their scale of operation. If they wanted to make instances persistent they would need replicate that data at least twice, if not three times, with at least one being in another data center. Imagine the amount of traffic that would be needed to keep a write-heavy database server consistent across multiple disks. Also, all the data to that last replica has to ship over the intertubes means say hello to Mr. latency. Ever notice that Amazons EBS volumes are replicated only within a single availability zone? I would be willing to bet this is because of network traffic and latency concerns.

One thing you will notice about a lot of the cloud providers that provide persistent instances (e.g. vCloud Express, ReliaCloud) is that they break that nice little "pay-per-use" model everyone is so fond of. In order to maintain those persistent instances most providers charge you even if they are "powered off". ReliaCloud instances are roughly 1/2 price when powered off while vCloud Express charges the full price. Rackspace has a somewhat different approach. First they RAID 10 the disk which should make failure less likely. In addition, they claim that if a failure occurs they will automatically relaunch your instance for you complete with data. How long does that fail-over take? They don't say and I have a hard time believing you will get a guarantee from them.

Taking a step back, what you will find is that Rackspace is really the midpoint between EC2 and vCloud in terms of persistence. On EC2 if an instance fails your it is gone along with your data (unless your using EBS). On vCloud, if your instance fails or you power it off it still persists. Rackspace falls in between in the sense that if your instance fails it will come back (with some delay) but if you shut off your server its gone along with its data. Thus, the only real way to make your data persistent on Rackspace is to keep the server running (at full price), or dump it into CloudFiles. This points out one of the really nice benefits of EBS which is that you can have persistent data without needing an instance to store it on (read: cheaper). But why would you want to store data without an instance attached to it you might ask? Its simple, there is probably some portion of data that you would like to keep persistent (e.g. database) and while you could dump it onto CloudFiles/S3, reading that data back onto a newly launched instance can take a loooong time. This is what it was like on EC2 pre-EBS and it wasn't pretty.

Now the trickier part of the question which is whether this persistence makes Rackspace a more attractive cloud infrastructure service. Having machines come back after failing is certainly a nice feature but it you still need to have a second server for fault-tolerance if you want to attain a reasonably high availability. While it may take some time it is almost certainly faster then the time it would take you to figure out that your EC2 instance has died and having you hustle over to your laptop to fire up a replacement. The bottom line is that it probably not available enough for any reasonable sized service to rely on it without a proper backup. On the other hand if you are hosting something like your blog where a few minutes of downtime isn't a critical then it can be a handy feature.

On paper it certainly looks nice, but in my opinion its not really a huge benefit. In all the time I have used EC2 I have only ever seen a few instances fail and it was after I had accidentally set the Java heap size to be 10x the available memory and then ran a big Hadoop job. The machine was thrashing so hard I'm not surprised it died. Aside from that (very extreme and operator induced) case have never seen an instance fail. In my experience EC2 instances simply don't fail frequently enough for this to be a big deal to me.

To be honest I think the fact that Rackspace allows you to "grow" your instance size is a much more attractive feature, but that's for another post...

For an interesting and thoughtful comparison of EC2 and Rackspace I would take a look at this blog post. He does a good job hitting on many of the important points and even agrees with my thoughts on the benefits of EBS.

Boy and Girl Birth Rate Interview Question

2009-11-14T14:21:00.000-08:00

I came across an interview question which is pretty interesting and a bit of a head fake (in my opinion). The question is...

"In a country in which people only want boys, every family continues to have children until they have a boy. if they have a girl, they have another child. if they have a boy, they stop. what is the proportion of boys to girls in the country?" --courtesy of fog creek software forums

After thinking about it for a little while I though I had the answer, but of course I was wrong. After reading the right answer I still couldn't quite convince myself that it was true, so I figured I'd test it out.

Sure enough the answer generated matches up with the explanation provided in the link above. So there you have it...