2011 PaaS Predictions: Winners and Losers

This past year has been an interesting one in the cloud space, and particularly for Platform-as-a-Service. While PaaS has certainly increased in popularity over the past year, I think it is still at the bottom of what is going to be a hockey-stick like growth curve. That obviously means that 2011 is going to be an even bigger year for PaaS. Given that is the case I wanted to put down my predictions for what I think it going to happen in the space over the next year.

Heroku

Heroku’s $212 million dollar acquisition by Salesforce was probably the biggest news of the year in the platform space. Not only is it a huge win for one of the most popular platforms, it is also a validation of the fact that the space is going to be huge. However, any time a small agile start up gets bought up by a giant corporation there are obvious concerns that they will lose their agility and get out of touch with the customers who made them successful. Given what I’ve seen from the Heroku guys so far I think they have a strong enough vision and understanding of their customers that they will be able to fend off the temptations of SF execs to meddle with their newly acquired toy.

I also suspect that we will see Heroku start getting a little more enterprisey in their offering over the next year. Unfortunately I don’t think they will see a lot of uptake since most enterprises move slower than molasses on a cold day.
Overall: OK

Google AppEngine

In addition to having incredibly liberal quotas, AppEngine has long been one of the most feature rich platforms available. Yet, there have been two main issues which really held back what would have otherwise been incredible growth, performance and datastore.
One of the most common complaints I have heard from developers using AppEngine is poor performance. This issue came to a boiling point earlier in the year when datastore latencies got so bad that Google stopped charging for Datastore CPU usage. The datastore issues were subsequently resolved and Google was able to deliver a substantially faster datastore API. Following this issue I think over the next year Google will start drilling down on other performance issues and deliver a much more performant and stable platform.

The other main point of friction for datastore has been the fact that it relies on a NoSQL like data model. While NoSQL is all the hotness right now, the vast majority of developers are still not comfortable using non-relational data models. That is why the announcement of a SQL database as part of AppEngine for Business came with much celebration. While it is still in private beta I would suspect that we will see it publicly released this year which will bring in a new wave of customers ready to take advantage of the large quotas and familiar SQL tools.
Overall: Very good

Djangy

While AppEngine is a great platform for Python applications there has been pent up demand for a Python/Django version of Heroku. New startup Djangy looks to be the first viable attempt at filling that gap. Looking at their documentation they seem to have taken the “Heroku for Python” manta very seriously, and that is a good thing. While Djangy is currently in private beta I’m guessing that they will be gearing up for public launch in the next 3-6 months. After that I think we will see quite a few Pythonistas migrate over to Djangy so they can get back to using the tools they are familiar with (Full Django + SQL).
Overall: Very good

DotCloud

DotCloud is a new Y-Combinator startup which aims to bring more flexibility to the platform space. Rather than a platform being married to a single language and database technology they aim to allow you to mix-and-match many languages (Ruby, PHP, Java, Javascript) and database technologies (MySQL, PostGREs, Redis, MongoDB). It is the choose your own adventure version of cloud platforms. Unfortunately I think using Dot Cloud will be just that, an adventure, and not the good kind.

As I discussed in my previous blog post, fragmentation is a huge issue for platform providers. I was discussing fragmentation within the context of a single language but the problem absolutely explodes when you start mixing in multiple languages with multiple databases. Consider the fact that even Google, a company with a tremendous amount of technical resources, only supports two languages and one database. While their vision is great one, I think they are going to run into some issues. In particular, they are going to reach a very unfortunate fork in the road. One path will be to only do barebones support for each language/database and risk frustrating customers with their lack of options. The alternative path will be to provide more in-depth support for each component and have to live the burden of regression testing the myriad of combinations possible, destroying their agility. Frustration seems inevitable either way.
Overall: Poor

Platform as a Service (PaaS) Fragmentation

You will find no shortage of cloud pundits around that will tell you that the Platform layer is the future of the cloud stack. The thought is that in the future we will be able to forget about the menial tasks of managing servers and just worry about developing applications. While a pleasant vision, let’s take a walk through the cold harsh reality.

In the PaaS space there is a big elephant in the room, it goes by the name fragmentation. Stop and think about it for a second, how many popular programming languages are there? How many popular frameworks does each one of those languages have? That's not to mention the hell that is versioning of both of those things.

Lets look at a quick example to illustrate this point. You’ve decided you want to create a platform, the first question is then what language you want your platform to support. After considering the options you decide on your language of choice. Depending on which part of the language release cycle you are in you are probably looking at a decent split between the current version, the previous version, and the experimental new version. Unless you get lucky in terms of timing or are particularly aggressive in deprecating versions, you are probably going to want to support all three versions. We see this exact trend in both the Ruby and Python worlds. Note that I haven’t even considered any alternative runtime implementations (e.g. Ruby Enterprise Edition, Python Unladen Swallow). Next is the choice of web framework(s). Almost every popular web programming language has a myriad of web frameworks to choose some. You have the most popular framework (Rails, Django), the minimal framework (Sinatra, web.py), and a whole bunch of other long tail frameworks. Depending on the language you could easily be looking at 3-5 viable contenders for frameworks, not to mention versioning. I could keep going, but I think you get the point.

This ties into a recent Twitter conversion where @georgevhulme posed the question:

“Who will win the Paas battle next year, and become the dominate platform?”

In my opinion, the answer to this question is none. No single company will become the dominant platform in the next year, or even next five years for that matter. The PaaS space is simply too fragmented for any one company to own a substantial portion.

Better Cache Busters (aka Asset Timestamps)

If you read any article on web performance they will almost undoubtedly mention expire-headers. They effectively tell the browser not to bother asking the server if an asset (e.g. an image or stylesheet) has changed until the expiration time. Ideally they should be set for many years in the future so that client browsers aren’t spending tons of time waiting for 304 Not Modified responses. The problem with expires headers is that when the file changes you need some way to signal the browser to fetch the updated file. The general trick is to change something in the URL which makes the browser think it is a new file and hence fetch it from the server. This is often called a cache-buster. That is why when you pop open Firebug you will often see files named puppies_3123141.png or styles.css?12345678.

Now that we know something needs to be added to the URL, the question becomes what should we add? There are three properties that we are really interested in:

1. The value should (only) change when the contents of the file changes
2. The value should be consistent across different machines
3. It should be fast to compute

The first possibility is to take the approach used by Ruby on Rails which is to use the modified time of the file. While this satisfies property 3, it does not work for 1 or 2. It breaks property 1 because appending an empty string to a file causes it modified time to update without the actual contents of the file changing. Property 2 is a much larger issue which many people have to face. Static assets are generally served from multiple machines, that means the modified time needs to be consistent across all of them. This is very difficult to achieve, particularly when the files are under version control. While modified time isn’t a great solution, it does bring us to another possibility, version control ids.

One of the great things about version control systems is that in order to give you a reference to a particular commit, unique ids have to be generated for each one. Given that is the case, we can simply use the current commit id (or hash for you git’ers) as the cache buster. We are currently on commit X, I update a file and commit it, we are now on commit Y. Since all machines will be updating their files from version control, everyone should be on the same commit id. In terms of speed, its not necessarily the fastest (especially if you are using SVN) but it only needs to retrieved once since it is the same for all files. The problem with commit ids is that it violates property 1. When a single file changes in the repo, every files’ id changes. That means that every time you deploy new code for your webapp, each client is going to fetch all the files again, even if all you changed was a README. Getting better, but still leaving something to be desired.

The last possibility I am going to talk about is an oldie but goodie, the MD5 hash. The cache buster of an asset is simply the MD5 hash of the asset itself. It satisfies property 1, and as an added bonus if the file is ever changed and then rolled back the MD5 hash will roll back with it (git reset anyone?). Property 2 is no problem, the contents of the file is the same across the different machines, hence so is the MD5. The only thing left is property 3, speed. Clearly computing the MD5 is going to be more time consuming than fetching modified time. However, just about every language has a standard hash library written in C for computing MD5’s and its pretty darn fast. The only place I could see this being an issue is if you have very large files or a extremely large number of them. Even still, you can just write a deploy script that precomputes all the hashes beforehand.

Overall, using MD5s as a cache buster give you all of the nice properties you could want in a cache buster with very little drawback. I went ahead and wrote a monkey patch for Rails that changes the asset id method to use MD5’s, the source code is available here (http://pastie.org/1164279). Enjoy busting caches.

Twitter is a bar and other Social Network metaphors

I have spoken with a few people recently about social networks and how they compare. In explaining it, I’ve settled on a few metaphors that most people seem to understand pretty intuitively. Here goes...


Twitter is a bar

Twitter is your neighborhood Irish bar on a Friday night. It’s busy, noisy, and there are tons of conversations happening simultaneously. This makes Twitter a great place for short, and casual conversations, particularly since there are a bunch of people around to talk to. With the many conversations happening within ear-shot, it’s easy to catch something of interest and hop in to another conversation at a moments notice.

However, the raucousness makes Twitter a difficult place to have an extended conversation. It’s certainly not the place to (at least easily) discuss the intricacies of foreign policy or War and Peace. The people who try to engage in these conversations generally are quickly frustrated and leave for other establishments.



Facebook is a coffee shop

Unlike the raucous and noisy crowd of Twitter, Facebook presents a much quieter and civilized environment akin to that of your local coffee shop. Plush and comfortable surroundings make Facebook a nicer environment for conversing with friends. Conversations can flow freely but be careful what you say since your parents or kindergarten teacher may walk in at any moment. While some people are there to chat with friends, others are simply hanging out and taking advantage of the free amenities (e.g. photos).



LinkedIn is a library

Last but not least, LinkedIn is your local library. Walking into the library the silence is almost deafening. The whispers of others serve as reminders that a few people occasionally visit this establishment. The library’s biggest asset is that it’s filled with valuable and accurate historical information. Yet, looking as this information you get the sense that it’s stale and rarely updated. While you may not be able to find out the latest news trends, if you want to know where your competitor’s CEO went college, this would be a good place to look.



So where do you want to hang out?

Killing Multithreaded Python Programs with Ctrl-C

If you have ever done multithreaded programming in Python you have probably found it frustrating that you can't simply hit Ctrl-C in the terminal and have it exit like a normal Python process. Instead you have to put the process in the background (Ctrl-D) and then either "kill %%" or kill the PID. The good news is that it doesn't have to be this way. After experimenting a bit I finally figured out why it doesn't work normally and what you have to do to make it work.

Normally when I write a threaded program in Python it looks something like this...


The problem is with this program is that if you hit Ctrl-C it doesn't do anything. The reason is that join() is a blocking operation. As a result the process will only receive the signal for Ctrl-C when join() becomes unblocked, which in this case will never happen.

In order to handle Ctrl-C with multiple threads you can use the following code:


This code does a few things differently in order to make it handle Ctrl-C as we would like. First, instead of using join() we use join(timeout) which tries to join a thread but will timeout if it does not occur after the timeout elapses. This allows the main thread of execution to continue doing other things, in particular waiting for a KeyboardInterrupt to be thrown which is what Ctrl-C raises. Since join will return upon timeout, we need to keep any threads which aren't None and respond to isAlive().

The next thing is that if child threads never return or take a really long time you need a way to notify the child that it should die. This is accomplished by the kill_recieved flag in the Worker class. When that flag is set by the parent process the child knows that it should finish up what it is doing and return.

The last thing is something that caught me off guard a bit. Initially, in the main() while loop I was trying to catch all exceptions that came up by using "try...except Exception:". As it turns out Exception does not include KeyboardInterrupt, meaning that Ctrl-C's that are raised in that block will not be caught. If you instead use "try...except KeyboardInterrupt:" or just "try..except:" it will work as you expect it to.

So there you have it...how to exit multithreaded Python programs using Ctrl-C.

Using Memcached as a Distributed Locking Service

One of the beauties of memcached is that while its interface is incredibly simple, it is so robust and flexible that it can do nearly anything.

As an example, I am currently working on a project where we need to do distributed locking. The reason for this is that we are running into what is generally know as the lost update problem. Basically, we need to read an object, update it, and write it back in a serialized fashion in order to ensure that no updates are lost. The easiest solution is to have a lock which must be acquired in order to do the read/update/write operations. Locks are great with a set of threads but once you break out of the context of a single machine, you need something distributed. While there are certainly other library or complex pieces of code that do this, I find this to be a pretty elegant solution for a set of nodes which have access to the same memcached servers.

Provided below is a class called MemcacheMutex which provides the standard acquire/release mutex interface but with the twist in that it is backed by memcached.


There is one big caveat. Since memcached is not persistent it is possible that the lock could get evicted from the cache. This could result in a case where P1 acquires the lock, the key gets evicted, then P2 is able to acquire the lock even though it has not been released by P1. If your memcached is doing tons of operations or you are holding onto the lock for really long periods of time then this could become an issue. In that case, you should use persistent version of memcached, memcachedb.

The Cloud Support Conundrum

Two nights ago I read an interesting tweet by @jclouds (Adrian Cole) that I think deserves some discussion. It is shown below:

rightscale might as well be a division of amazon; not much #cloud diversity in their offerings

For the record RightScale supports Rackspace, GoGrid, and Eucalyptus in addition to AWS. However, I think there is a bigger issue at hand which is how many clouds should cloud management platforms really support?

I'm going to take the opposite stance on this issue. I actually think it is quite impressive that cloud management platforms in general support as many clouds as they do. From a financial perspective you only really need to support one cloud (AWS) since it is dominates usage by far. However, I think most people agree that putting all of the eggs in one basket is foolhardy. So the next logical move is to support the next biggest player in the market, Rackspace. That is why support for AWS and Rackspace is ubiquitous across all cloud management services (RightScale, enStratus, cloudkick). At that point even if one of your legs gets kicked out from under you, you still have another one to stand on.

Aside from AWS and Rackspace, there are a smattering of "other" clouds for which adoption and usage has been small. Case and point, in addition to AWS and Rackspace, the three aforementioned cloud management services support a total of nine additional clouds. Yet, the only overlap across those nine clouds is RightScale and Cloudkick supporting GoGrid. The rest are unique to that particular service.

While I would like to think this that the divergent opinions in cloud support are a result of too many awesome clouds being available, that is simply not the case. Instead, these platforms have the unpleasant task of trying to find clouds that are sufficiently feature rich enough to support actual usage. While most clouds seem to be improving in providing the endemic cloud features, this process can be slow and leave much to be desired.

The alternative is customers asking management platforms to support some of the other clouds. While this can and probably does happen, I can't imagine it happens in a sufficient enough quantity (or with sufficient $) to make it a worthwhile venture. If you are ask customers they would love it if you added support for <some obscure cloud only they use> or even better <some cloud their son is building as a science fair project> but lets be realistic, it's not gonna happen.

With that being said, I do think there will be more diversity in cloud offerings, just not as quickly as most cloud folks want it happen. At this point the focus should be getting people onto the cloud, not a cloud.

Unknown Substitution Cipher Interview Question

You are given a file containing a list of strings (one per line). The strings are sorted and then encrypted using an unknown substitution cipher (e.g. a < c, b < r, c < d). How do you determine what the mapping is for the substitution cipher? The unencrypted strings can be in any language.

After a few questions the interviewer clarified and stated that the cipher was a simple substitution cipher where every letter in the alphabet is mapped to one and only one other character in the alphabet.

Given this problem the simplest solution (not counting brute force) would be to do frequency analysis. If the strings contain words or phrases in a popular language there is a reasonable shot that frequency analysis would at least get you started figuring out the cipher. After discussing this and how it would work the interviewer told me "you do frequency analysis and cannot draw any conclusions, what do you do then?"

Ah, back to the (metaphorical) drawing board. The next big clue is the fact that the strings are sorted. This type of interview question is generally worded very specifically so when what seems like in extraneous factor is "thrown in" it probably means something. In this case the fact that the strings are sorted tell you something, in particular, precedence of the letters. For example, given the following strings (which are sorted in ascending order)

ggcaa
ggpqr
grfzu

you can start to learn about the ordering of the encrypted letters. For example, in the above example we know that c < p from the first two strings and g < r from the second two strings. We can only use the first letter that differs from any two pairs of strings, everything afterward is inconsequential. Doing this over the entire file you would get a set of relationships (e.g. c < p, g < r, q < z, z < t). The next questions are then "what can you do with this data and what is the most appropriate method for storing it?"

This is where things get tricky. After going through some of the common data structures it seemed like a tree, in particular a binary tree, might be the right fit. The problem with a binary tree is that you don't necessarily have all the required information to create the tree. For example, say the root of the tree was the letter Z and it had a left child Q. Now say you are trying to insert the letter G. You may know that G < Z so it needs to be in its left sub-tree, however you may not know what the relationship is between G and Q. In going through the list of strings, you are not guaranteed to have all relationships between all letters.

Without all the relationships you are forced to fall back to a graph structure. Each letter becomes a node in the graph and each greater than relationship becomes a directional edge in the graph (g < y would have an arrow pointing from node g to node y). With the graph populated with all the edges, now what do you do? The interviewer assured me that each node is part of the graph (i.e. there are no partitions).

This leads to one of the tricky parts about graphs. Since all nodes are equal, where do you start? There is no "root" node like there is in a tree. In this case you want to start with the node with no incoming edges, this is the smallest letter. Such a node has to exist since there cannot be cycles in the graph. You can then delete that node from the graph and then repeat this process until there are no more nodes in the graph.

Now for the fun part, what is the complexity of this algorithm? Finding the smallest node in the graph is N complexity for the first node, N-1 for the second node, and so on which leaves you with N+(N-1)+(N-2)+...+1. This is can also be written as the summation from 1 to N of i which equals (N*(N+1))/2 (see here if you don't believe me). This means that the algorithm is O(n^2).


Enjoy.

NoSQL: The Paradox of Choice

I have been watching a lot of TED talks recently and while there are a ton of excellent talks, the one I want to talk about today is called "The Paradox of Choice" by Barry Schwartz (he has a book by the same title). While watching his talk it struck me how relevant the topic is to NoSQL databases. I want to talk about these connections and their implications.

More != Better

I want start with an idea that many people know intuitively but is still worth calling out explicitly. Having more options is not necessarily a good thing, in fact, in many cases its a bad thing. In his talk Schwartz cites a study which showed that for every additional 10 mutual funds offered by an employer, participation went down by 2%. As the number of options increases, the amount of time and effort required to make a decision increases significantly, to the point where you are put into a state of paralysis as a result of the myriad of options. For all intents and purposes the birth of the NoSQL database came with the publication of the BigTable paper by Google and the Dynamo paper by Amazon circa 2006. Since then, the number of NoSQL databases has gone from two, to five, to 29. That's right, http://nosql-database.org/ currently lists 29 different NoSQL databases each of which has slightly different feature set and benefit/drawback trade offs. Good luck picking the right one.

More Options => Higher Expectations

When many options exist it is only natural for us to expect that one of them hasto have the features you are looking for. With only one option it doesn't matter if it has the features you want since you don't have a choice. Prior to the NoSQL movement there was only one game in town and it was the relational database (MySQL/PostGREs/MSSQL), so you had no choice but to grin and bear it. However, now that there are almost 30 options (there probably will be by the time I'm done with this post) one of them has to have the right mix of peanut butter and chocolate to fit my tastebuds. Unfortunately this rarely turns out to be the case.

Higher expectations not only apply to features, but also performance. The promise of infinite scalability will draw a lot of eyes and ears, but in order to win them over you have to show users tangible benefits. I cant count how many blog posts I have read about people/companies who are using MySQL as a key-value store because it is faster then the key-value stores themselves. In order for NoSQL databases to win, users need to be able to benchmark the database and be impressed by its performance.

What have we learned from this? For all the hype it is getting, NoSQL is in a rough spot. The number of options is large and there aren't clear winners in any category. A few of the databases seem to be rising to the top (Cassandra for instance), but until enough people get behind a relatively small number of databases (I say pick 3) the knowledge, tool support, and amount of helpful information available online is going to be to painfully low.

Peering In The Clouds

An idea that I have been kicking around in my head for some time now is that of cloud peering. This idea comes from the networking context whereby big ISPs will negotiate a peering agreement that allows customers on the two networks to communicate with each other. While the big ISPs are cut-throat competitors they understand that if they do not work together that the Internet would turn into large islands where customers can only communicate to others on their island.

Here is what Wikipedia has to say about the benefits of peering:

Peering involves two networks coming together to exchange traffic with each other freely, and for mutual benefit. This 'mutual benefit' is most often the motivation behind peering, which is often described solely by "reduced costs for transit services". Other less tangible motivations can include:
  • Increased redundancy (by reducing dependence on one or more transit providers).
  • Increased capacity for extremely large amounts of traffic (distributing traffic across many networks).
  • Increased routing control over your traffic.
  • Improved performance (attempting to bypass potential bottlenecks with a "direct" path).
  • Improved perception of your network (being able to claim a "higher tier").
  • Ease of requesting for emergency aid (from friendly peers).

Are those not things that every cloud infrastructure provider would love to have more of? More redundancy/capacity/performance seems compelling to me.

The way I would envision such a system working would be for two infrastructure providers to agree to provide free bandwidth between their clouds. For why this is important, think about the cloud today. When trying to transfer data between clouds what ends up happening is you charged going in both directions (i.e. out of one cloud, into the other). Say for example you were running an application in the Rackspace cloud but doing nightly DB backups to S3 for redundancy, you would end up paying 22 cents/gig to Rackspace for the outbound traffic and another 15 cents/gig for traffic into S3. At 37 cents/gig you could quickly start racking up a nice bill just for bandwidth. Cloud peering could alleviate a lot of this cost while attracting more customers to the cloud by reducing the concern over lock-in and outages.

This strategy seems particularly attractive for some of the smaller players which are looking to grow market share. It will be interesting to see if anything like this develops in the future.

Designers Guide to Web Performance

I have found myself reading quite a bit about usability and web design recently. While I have learned a lot about design I have also learned that there is a large portion of the design community which is not terribly familiar with how they can improve performance of the websites they are creating. While a pixel perfect layout is a beautiful thing, no one wants to wait 20 seconds for it to load. Since the web design community has taught me so much, I wanted to give back by writing this post on some simple techniques for improving site performance.

Welcome to Web Performance 101, here is your first assignment:




The above is a waterfall chart which shows how long the pieces of a particular website take to load. To be nice I've blanked out the website URL's but I left the file extensions since they will come in handy later. What this chart shows us is that initially the browser made a request for the index ("/") page and 1342 ms later it had received all of HTML for that page. One second in the browser makes a request for the first javascript file which takes 585 ms to load. So on and so forth for about five seconds. While there was more to the chart I cut it off for brevity.

Progressive Rendering

You probably have noticed that often times your browser starts drawing the website before the entire thing has loaded, this is called progressive rendering. This is really nice because if done correctly it can make your website "feel" much faster for end users. In order for the browser to start drawing the page it needs two things, the HTML and the CSS. Until the HTML/CSS are downloaded your users are going to be staring at a blank white screen. This is why it is really important to put CSS in the head of the document before javascript and images. In the chart above there is a vertical green line at about 2.25 seconds, that is when the browser could start rendering the page. Had the designer of the site put the CSS before those five javascript files the page could have started rendering at 1.5 seconds or even earlier. Its such a simple an easy change, that there really is no reason not to.

Javascript: The Performance Killer

While every website is going to have a waterfall pattern the goal is to have a very steep waterfall. The further the bars extend out to the right of the graph, the longer the site takes to load. Hence, we want to shoot for a very steep and short waterfall which should translate into a fast site. Unfortunately this chart looks more like stairs then a waterfall, the reason is javascript.

While it may not be obvious to you, browsers are pretty smart. For example, when fetching the images on a website most browsers can download more than one at a time. Older browsers (IE7, Firefox 2) usually download two items at a time and newer browsers (IE8, Firefox 3) download 6 or more at a time. Notice how there is overlap between items 8 & 9 as well as 10 & 11, those items were being downloaded in parallel. Then notice how there is no overlap between items 2-7. This is because unlike images or CSS, javascript blocks the browser from downloading anything else. While newer browsers tend to do a better job at this and download them parallel, most of the world is still running IE7 which is what was used to generate the above chart.

One way to improve this would be to take all the separate javascript files and combine them into a single larger file. This will reduce a lot of the overhead incurred by download many files. Note how most of the bar of the javascript file is actually green, that is "time to first byte." It is the time between when your browser say "Hey give me foo.js" and when it finally receives the first byte of that file. You can barely even see the blue lines at the end which is the time spent downloading the actual data. By combining those javascript files together you cut out most of the time spent waiting for the data to come back. There are many tools online and available for download which will automatically concatenate the files together for you.

Zipping it Up

The easiest way to ensure that your site loads fast is to send less data. Less data means less time required to download it which means better performance. This is where data compression comes in. Most big web servers provide the ability to compress the data it sends out into a much smaller format so that it can be transferred more quickly. Then once it reaches the browser it is smart enough to know that it is compressed and it will uncompress it and read it just the same way it normally would. The most popular compression algorithm used today is called GZIP, and it can often cut file size in half or even a third. It is important to note that this compression is only effective on text data, so you want to gzip your HTML, CSS, and Javascript files. You don't want to gzip images since it usually eats up more server CPU then its worth. The best thing about gzip is that it only requires a few extra lines in the server config file and then your done. This website has a good guide for how to enable it for Apache.

As I mentioned above it is often a good idea to combine all of your javascript files into a single file, the same holds true for CSS. So if you have your css split across many files, combine it into a single file so that it loads faster. An extra bonus to this is that compression rates tend to get better as files get larger. Thus while the amount of text is still the same you will probably end up transferring less data and skipping much of the time wasted waiting for the data to arrive.

That's where we will stop for today. Hopefully you learned a few things about web performance which you can apply to websites you are working on. Keep in mind there are a lot more tips and tricks for improving performance, this is only the tip of the iceberg.

Private clouds are transitional

Given that I am going to be talking with the Eucalyptus guys in a week I figured it was a pretty opportune time to sit down and really think about private clouds. Much to the chagrin of some people, I firmly believe that private clouds are still clouds. In my mind the real question is where do private clouds fit in the cloud ecosystem?

Lets get one thing straight, private clouds are transitional. What I mean by this is that private clouds are not going to be here forever, they are instead filling in a very important gap in the current cloud landscape. While a large number of servers could be moved into the cloud today, there are still quite a few use cases which don't allow for such rapid change.

To illustrate one such example, lets say you head IT for some Fortune 500 mega-corporation. After a few months of waiting your purchase order for X racks of servers (finally) goes through. You estimated that with the servers you have ordered you should have sufficient capacity for the next year. Shortly thereafter you reassess the opportunities provided by the cloud and deem it fit for use by your company. So enthusiastic about the change you even decide to join Lew Moorman and Marc Benioff on stage at some cloud event to chant "No more servers!" and pledge never to buy another one. However, upon returning to the reality of work you recall that in addition to the servers you currently have running in your data center, the new racks you just ordered are going to provide you sufficient capacity for the next year. Assuming a 3-5 year lifespan of the average server you are looking at at least two years before you can move a majority of your servers into the cloud. Enter the private cloud. You already have the resources, there is no reason for them to sit around and gather dust. Make your own private cloud with the existing resources and expand out into the public cloud in as necessary. Once the servers in your data center have run their course, toss them and move to the cloud.

The other big elephant in the room with regards to moving to the public cloud is compliance. These days it seems like every big industry has its own compliance and regulatory constraints that must be met. Whether it's PCI for credit card processing or HIPAA in the health fields, almost none of the big cloud vendors have met the requirements for becoming compliant. In fact, its not even clear that they are trying. Unfortunately, regulation is not something that can be skirted around, it is a big time show stopper. This means that companies in industries which have regulatory requirements are going to be in a holding pattern around the public cloud for the foreseeable future. What's the next best thing? That's right, private cloud.

While private clouds are transitional, they will be around until the aforementioned issues are addressed. For some this may be on the order or months or a year, but for others it will probably be much longer. The regulatory issue in particular is not an overnight fix, I think it is going to be a big parachute that prevents the cloud from running at full sprint for the year. So while the cloud is transitional, that transition period is looking like its going to be a long one.

Rails named_scope Time Surprise

I have written previously about how much I like named_scopes in Rails and I still do very much. After using them for some time I got tripped up on an issue that came up with them which surprised me a bit. I thought it would be a good idea to document it here in case others have the same issue.

To demonstrate the issue, lets say I have an app that has a User model. On the home page I want to display a list of the users which have signed up in the last hour. This is an excellent use case for a named scope. We can start by creating a named scope called "recent" which will then allow us to simply say "User.recent" to retrieve all the recently created accounts from the database. This seems simple enough so I went ahead and wrote it up as follows:



Now you will notice that I named it recent_bad, and that is because this named scope is BAD! Take a look at the queries generated when I call recent_bad three times, notice anything wrong? Its subtle. Note how the date after created_at "2010-01-05 16:55:44" never changes. For effect I made the model acts_as_paranoid so you can see what the timestamp should be. What is happening here? The named_scope is at the class level, that means that the when the User class is loaded the Time.now.utc is evaluated once and then never again. This is why the time only changes when the server is restarted. In order to avoid this issue simply put the condition within a lambda as follows:



Now you will see that the created_at time is updating as it should be. Its a subtle bug and one that caught me by surprise.

Slashdot loves Cloud Computing!

One of the reasons I love reading Slashdot is because the inhabitants of their community are unlike any other. This morning I woke up to an article on the front page about a venture capitalist who was defending the incredibly popular social game Farmville. Being interested in the topic I thought I would take a gander at the comments, that is when I came across this gem (direct link):

    This needs to be the year that those of us with even the slightest degree of technical knowledge take a stand against the goddamn "Cloud".
    It sounds fantastic in theory, but once in the real world, Cloud Computing falls flat on its face. My development and ops teams wasted too much time dealing with Cloud providers over the past year. So my resolution this year is to tell anyone who proposes the use of anything Cloud to cram it. We aren't doing it any longer. It's a failed approach.
    Just last week, during the holidays, we had to scramble after one of our Cloud providers ran into some hardware problems and couldn't get our service restored in a timely manner. After the outage exceeded my threshold, I called up my best developers and had them put together a locally-hosted solution in a rush, and payed them quite a bit more than usual due to the inconvenient timing. Then I called up the Cloud provider and basically told our rep there that we are done using them and their shitty service. Then I called up the manager in our company who recommended them, and told him to basically go smoke a horse's cock.

The commenter was apparently so proud of their work that they decided to post it anonymously. Now keep in mind that the article was about social gaming, it had nothing to do with the cloud. While Farmville does run on Amazon EC2, the article does not mention or discuss that at any point. Regardless, lets take a look at this comment and pretend that it was posted in a reasonable context.

Perhaps my favorite thing about the comment is the fact that it makes these huge substantive claims yet provides absolutely zero reasoning behind them. For example, "It sounds fantastic in theory, but once in the real world, Cloud Computing falls flat on its face." Really? In what sense? You must mean Animoto scaling form 40 to 4,000 servers in 3 days. Or how about the millions of people who are using Google Apps. Both of those are certainly real world and as far as I can tell there was very little falling on faces. In fact, the cloud helped Animoto avoid falling on its face!

As if the first claim wasn't enough the end of the second paragraph provides a real doozie. It completely writes off Cloud Computing by stating that "[i]t's a failed approach." Well I'd better get on the phone and tell all the people who put 64 billion objects into Amazon S3. That might take a little while...

The last paragraph at least provides a little bit of background on why this particular individual will "tell anyone who proposes the use of anything Cloud to cram it." (Well if he is going to tell them all to cram it maybe he can make all those calls to the S3 users for me...). This is where I start to get a little empathetic. Downtime sucks, it really does. Customers and providers can both agree that downtime sucks since everyone loses when it rears its ugly head. My question is this, if this service is so critical then why wasn't it built to be fault tolerant? If you are truly concerned about availability then you need to either a) build the service such that it can withstand failure or b) have an SLA in place with the provider. But honestly, who wants to do that? Instead I would recommend following the actions of the commenter which is 1) don't take the necessary precautionary steps to avoid downtime and then 2) complain when there is downtime. Whats next, eating three Big Macs a day and then complaining when you need triple-bypass surgery?

Lastly, I would like to commend the brave commenter for being brazen enough to tell a manager at his/her company to "basically go smoke a horse's cock." I can see your career blossoming as we speak.