Numbers Everyone Should Know

I was looking over a presentation titled "Designs, Lessons and Advice from Building Large Distributed Systems" By Jeff Dean from Google when I came across a really useful slide. Its the 24th slide in the presentation and is titled "Numbers Everyone Should Know". It has the latency of some common processor/network operations:

L1 cache reference..............................0.5ns
Branch mispredict.................................5ns
L2 cache reference................................7ns
Mutex lock/unlock................................25ns
Memory reference................................100ns
Compress 1K bytes with Zippy..................3,000ns
Send 2k bytes over 1Gbps network.............20,000ns
Read 1MB sequentially from memory...........250,000ns
Round trip within datacenter................500,000ns
Disk seek................................10,000,000ns
Read 1MB sequentially from disk..........20,000,000ns
Send packet CA->Netherlands->CA.........150,000,000ns

Having numbers like these are really useful for just ballpark estimates. I think the biggest surprise to me was the huge disparity in a datacenter round trip to disk seek. We all know how slow the disk is, but the fact that you can make 20 round trips within a data center in the time it takes just to make a disk seek (not even reading any data!) was pretty interesting. It reminds me a lot of Jim Grays famous storage latency picture which shows that if the registers were how long it takes you to fetch data from your brain then disk is the equivalent to fetching data from pluto.

Be Wary of the Paranoid

I recall distinctly when first learning about Rails and plugins that the very first one I used was acts_as_paranoid. Something about actually deleting data concerned me and so I figured adding acts_as_paranoid to some important tables in my application would save a lot of headaches. While it is tremendously useful it also has a pretty big unintended consequence that I think gets overlooked by most. Lets take an example query generated by acts_as_parnoid.

>> User.find_by_id(7)
SELECT * FROM `users` WHERE (`users`.`id` = 7) AND (users.deleted_at IS NULL OR users.deleted_at > '2009-11-26 02:17:28')

Looks harmless right? What I want to draw attention to is the "OR users.deleted_at" part. In order to ensure that the user isn't deleted it checks not only that the deleted_at field is NULL, but also that deleted_at is greater then the current time. In reality the IS NULL check is sufficient unless you are setting the deleted_at of some object to be in the future. I have yet to see anyone actually use it in that way. This is what makes the use of the current time in the query so bad, it is slowing down tons of Rails applications and most people don't even know it.

One important thing to notice about the MySQL query cache is that it is pretty dumb. Basically it caches the incoming query string exactly as written and then stores its associated result set. This becomes a problem when you use something like the current time in the query string, it functions as a cache-buster each second. So at 0 seconds you make a query and it is stored in the cache, then at 0.5 seconds you make the same query and it is read from cache, then at 1.0 seconds you make the same query but it will miss the cache since the time has increased by a second. This means that anything written to the query cache which uses acts_as_paranoid effectively has a 1 second expiration time. That's awful , and all that for the 0.005% of users who want to expire things in the future. Not to mention the fact that it completely pollutes the cache with old data which never gets touched a second after its written.

Alright, enough moaning, here's how to fix the problem. Open up paranoid.rb and in the "with_deleted_scope" function rip out "OR #{table_name}.#{deleted_attribute} > ?" along with the current_time variable after it. Similarly in has_many_through_without_deleted_assocation.rb in the construct_conditions method delete the same string where it is appended to the conditions variable. Keep in mind that if you are setting deleted_at to values in the future then you don't want to make this change. But for everyone else, enjoy the improvement in your query cache hit rate.

As a final note, for the tables which you have made paranoid you probably also want to consider adding an index which includes the deleted_at field since it will be a condition of every SQL query on that table.

Updated: There is a fork of acts_as_paranoid courtesy of mikelovesrobots that provides the fixes that I talked about previously, it is available here. I'm gonna switch out my versions of acts_as_paranoid for this one, I'd suggest you do the same.

Lacking in Persistence

A few weeks back was the very first incarnation of Cloud Camp in the Cloud. While I was unable to attend I did get around to watching the screencast of it courtesy of @ruv (available here). There was certainly a bunch of good information and intelligent discussion but I found one question to be particularly interesting and insightful. The astute attendee asked "Why was Amazon EC2 designed such that instances have transient (ephemeral) storage? Rackspace has been pushing their marketing on the fact that their servers have persistent storage is this a big deal?" That is certainly a loaded question but I think I'm in a reasonable position to take a shot at it.

Lets start by talking about some of the reasoning behind why Amazon would make their instances transient. The first reason is simple, making instances transient makes life a lot easier for them, particularly given their scale of operation. If they wanted to make instances persistent they would need replicate that data at least twice, if not three times, with at least one being in another data center. Imagine the amount of traffic that would be needed to keep a write-heavy database server consistent across multiple disks. Also, all the data to that last replica has to ship over the intertubes means say hello to Mr. latency. Ever notice that Amazons EBS volumes are replicated only within a single availability zone? I would be willing to bet this is because of network traffic and latency concerns.

One thing you will notice about a lot of the cloud providers that provide persistent instances (e.g. vCloud Express, ReliaCloud) is that they break that nice little "pay-per-use" model everyone is so fond of. In order to maintain those persistent instances most providers charge you even if they are "powered off". ReliaCloud instances are roughly 1/2 price when powered off while vCloud Express charges the full price. Rackspace has a somewhat different approach. First they RAID 10 the disk which should make failure less likely. In addition, they claim that if a failure occurs they will automatically relaunch your instance for you complete with data. How long does that fail-over take? They don't say and I have a hard time believing you will get a guarantee from them.

Taking a step back, what you will find is that Rackspace is really the midpoint between EC2 and vCloud in terms of persistence. On EC2 if an instance fails your it is gone along with your data (unless your using EBS). On vCloud, if your instance fails or you power it off it still persists. Rackspace falls in between in the sense that if your instance fails it will come back (with some delay) but if you shut off your server its gone along with its data. Thus, the only real way to make your data persistent on Rackspace is to keep the server running (at full price), or dump it into CloudFiles. This points out one of the really nice benefits of EBS which is that you can have persistent data without needing an instance to store it on (read: cheaper). But why would you want to store data without an instance attached to it you might ask? Its simple, there is probably some portion of data that you would like to keep persistent (e.g. database) and while you could dump it onto CloudFiles/S3, reading that data back onto a newly launched instance can take a loooong time. This is what it was like on EC2 pre-EBS and it wasn't pretty.

Now the trickier part of the question which is whether this persistence makes Rackspace a more attractive cloud infrastructure service. Having machines come back after failing is certainly a nice feature but it you still need to have a second server for fault-tolerance if you want to attain a reasonably high availability. While it may take some time it is almost certainly faster then the time it would take you to figure out that your EC2 instance has died and having you hustle over to your laptop to fire up a replacement. The bottom line is that it probably not available enough for any reasonable sized service to rely on it without a proper backup. On the other hand if you are hosting something like your blog where a few minutes of downtime isn't a critical then it can be a handy feature.

On paper it certainly looks nice, but in my opinion its not really a huge benefit. In all the time I have used EC2 I have only ever seen a few instances fail and it was after I had accidentally set the Java heap size to be 10x the available memory and then ran a big Hadoop job. The machine was thrashing so hard I'm not surprised it died. Aside from that (very extreme and operator induced) case have never seen an instance fail. In my experience EC2 instances simply don't fail frequently enough for this to be a big deal to me.

To be honest I think the fact that Rackspace allows you to "grow" your instance size is a much more attractive feature, but that's for another post...


For an interesting and thoughtful comparison of EC2 and Rackspace I would take a look at this blog post. He does a good job hitting on many of the important points and even agrees with my thoughts on the benefits of EBS.

Boy and Girl Birth Rate Interview Question

I came across an interview question which is pretty interesting and a bit of a head fake (in my opinion). The question is...
"In a country in which people only want boys, every family continues to have children until they have a boy. if they have a girl, they have another child. if they have a boy, they stop. what is the proportion of boys to girls in the country?" --courtesy of fog creek software forums
After thinking about it for a little while I though I had the answer, but of course I was wrong. After reading the right answer I still couldn't quite convince myself that it was true, so I figured I'd test it out.



Sure enough the answer generated matches up with the explanation provided in the link above. So there you have it...

Installing RubyGems faster

One of the things I learned about over summer while using Rails quite a bit was that the gem installation process can be a slower one, especially when you are installing lots of gems. In many cases it is not even installing the actual gem that takes all the time, its generating the rdoc and ri information. In my experience very few developers actually use the rdoc information on their local boxes, and no one should be looking at rdocs on your production environment, so why bother installing them?
You can prevent those from being created by adding special flags to the end of the install command (e.g. "gem install rspec --no-ri --no-rdoc"). This is nice but I seem to always forget to add the flags until its too late and the gems are already installing. This can be fixed by adding the flags to your gemrc so it will happen automatically. Simply open your ~/.gemrc file and add the following line to the end "gem: --no-ri --no-rdoc". My .gemrc was created by root so I needed sudo to edit the file, but this may not be the case for you. Just for reference my .gemrc now looks like this:

---
:sources:
 - http://gems.rubyforge.org/
 - http://gems.github.com
:benchmark: false
:backtrace: false
:update_sources: true
:bulk_threshold: 1000
:verbose: true
gem: --no-ri --no-rdoc

As a quick test I installed the cucumber gem on my local box without the flags and it took 31 seconds. After changing my gemrc to include the flags the same installation time took 13 seconds, a pretty nice improvement. If you are deploying your app in an environment like RightScale where your machines are configured at boot-time I would certainly include that line in your gemrc, it should speed up the gem installation process a good deal.

The cloud solves a lot of problems, stupidity isn't one of them

The past few weeks brought some really unfortunate news for users of T-Mobile's Sidekick phones. This started with a outage of their data service which started on Friday the 2nd and lasting four dreadfully long days. During that time users couldn't access the Internet or, more importantly, their contact information since that information is all stored remotely. News only got worse this weekend when Danger (the company that makes the phone, also a Microsoft subsidiary) announced that the data not stored on the phones had "almost certainly has been lost" and that the chances of it being recovered was "extremely low".

A big undertone to this event has been "Should we continue to trust cloud computing content providers with our personal information?" (from this Slashdot article). Many people have pointed out that Microsoft purchased Danger about a year ago, and that this catastrophic data loss has cast a cloud over the soon to be launched Microsoft cloud, Azure. This is where I take issue.

First off, since when has the simple act of storing data remotely constituted cloud computing? Regardless of what definition of the cloud you subscribe to it probably has the words "virtualization", "elasticity", and "pay-per-use" in it somewhere. I don't see any of these three things or any other cloud-like properties which would lead me to believe that Sidekick == cloud. However, let us take this ridiculous assumption of the Sidekick being cloud and continue with it.

While there has been no official announcement from Danger regarding the cause of the data loss, word has surfaced that it was a result of a botched SAN upgrade. While things certainly can go very wrong when messing with a SAN, the kicker is that no backup was made prior to attempting the stunt. As far as I can tell, no backups were made at all (or at least ones that worked). Like the title says, the cloud solves a lot of problems, but stupidity isn't one of them. With its seemingly unlimited amount of storage and minimal cost its just plain stupid to not make backups of any important data. Better yet, get it all nice and encrypted and use something like the simple cloud API to back it up to multiple storage providers. Why not? In the long run the cost of keeping tons of backups of that data is so trivial that it shouldn't warrant a second thought.

Given that Danger was purchased by Microsoft many have now brought up the question, how does this affect Azure? The answer, it doesn't. If you have ever worked for or dealt with a large company you know that it takes a long time to get anything done. It has only been 18 months since Microsoft purchased Danger so I have a hard time believing that much changed for Danger aside from the sign on the building (if that). This is particularly true about system architecture where things are so complicated that the old adage "if it ain't broke don't fix it" often holds true until the very bitter end. It is pretty clear that the Danger infrastructure wasn't running on Azure and the two have very little in common. I think the only real impact this incident will have on Microsoft is for them to receive more questions the reliability of their data storage. Their response will be that they replicate data 3 times across multiple geographically separate data centers (this is just a guess) and then everyone will report this back to their CIO's who will approve and then everyone goes home happy. End of story.

Updated: Microsoft has just confirmed that they have been able to recover most, if not all of the data lost in the sidekick outage. Thats great news for sidekick customers, while they were out of a usable phone for a few weeks, getting they contacts back is a big win. As it turns out that they did have some backups in place and were able to recover from that, although it took quite a while (and still continues). While ultimately little or no data may have been lost, the damage has been done.

Fun with named scopes in Rails

One of the features that I definitely had no idea about when I first learned Rails was named_scope. I went back at look a look at some of the old projects I had worked on I frequently found myself writing a bunch of extra finder conditions or methods in order to achieve that goal.

That is a relatively simple example, but since these conditions are used frequently I can improve upon this by using named scopes. Named scopes allow you to encapsulate some finder arguments into a simple, chain-able, and efficient methods. Here is what the named scope definitions and finder looks like:

While these could be defined as standard methods, you loose a lot of the power and flexibility which named scopes provides. For example, with named scopes I can add in additional arguments or conditions to any of the above methods. For example:

If you are using Rails 2.3 they have added a feature called scoped_by which will dynamically generate a lot of the boiler-plate scopes that you would need.

If you aren't using Rails 2.3 or aren't crazy about that syntax there is also a Ruby gem which provides a lot of the same functionality in an arguably more elegant manner. Its called Pacecar and it made by the guys over at Thoughtbot. Using Pacecar the above example would look like:

That looks a lot more elegant to me. So get out there and start using some named scopes.

Ballmers iPhone stunt could hurt Microsoft long-term

Just about a week ago news came out that Microsoft CEO Steve Ballmer had a strong reaction to an employee who tried to take his picture using an iPhone. At the company meeting in Safeco field, an employee was trying to take his picture when Ballmer grabbed the device, said a few remarks about it, and then pretended to stomp it into the ground.

Anybody who has ever seen Ballmer speak knows that he is a passionate guy. While most people can appreciate the passion and enthusiasm that he brings to the table, there are times when this can bite you. I think this was one of those times. Let me explain...

Prior to last week, Microsoft had actually made some inroads on the iPhone with the release of the SeaDragon and Microsoft Tag apps. While I would not consider these everyday-use type apps, they were a pretty clear step in embracing the iPhone platform and utilizing it strengths. I played with the SeaDragon app quite a bit and it really is a perfect marriage of the two technologies (e.g. pinching to zoom on the large images, etc).

In my eyes those inroads have now been left be left to rot as a result of Ballmers actions. Think about it this way, say you are a Microsoftie who has a great idea for an app which really showcases a Microsoft technology, would you be willing to stand up in front of Ballmer and pitch the idea? That idea had better be an incredibly good one with a ton of potential otherwise...

Who does this really hurt? Search. Contrary to what many people believed, Microsoft has really made a big comeback in the Search space with Bing. In fact it was this past August that Bing surpassed 10% marketshare which is a pretty big milestone in its growth along with a nice month over month improvement. Currently desktop search dominates the market, but anyone with a little bit of foresight will tell you that mobile search is and will continue to be a big growth area in the near future. Unfortunately, Bing's presence in the mobile market (particularly the iPhone) is...shall we say...weak. Yeah, they have a mobile site, but thats about it. Unlike Google and Yahoo!, Bing has no search app as far as I can tell. Furthermore you can't even set Bing to your default search engine (Google/Yahoo! are currently the only options).

Now if your Bing fighting for each point of marketshare, dumping tons of money on the "search overload" ads, why not get a few developers together and have them make a Bing iPhone app? One word, Ballmer.

When being anti-web is just not cool: Fever RSS

I have been a pretty avid user of RSS for a few years now and as a technology it is awesome. Just about everybody knows that pulling technologies are much more efficient then pushing and thats what makes it possible to consume so much more information. To show you what I mean, looking at my Google Reader I am currently subscribed to 80 feeds and in the past 30 days I've read over 6,000 articles (~200/day). While I really like RSS, one of the things I really found missing was the lack of intelligence of RSS readers. Instead of just displaying me all the information in reverse chronological order, how about displaying them based on how interested I am in that article or feed. Sure that is a little more complicated, but in a world where theres tons of information it is a lot more efficient. The reality is that it is just a big text mining and ranking problem with the data right there in front of you, so get cracking. I'm genuinely surprised Google hasn't tried to tackle this problem since it is right in their wheel house, but that is a whole other story.

Over the past year or so I have been keeping my ear to the ground for new RSS readers which try to do what was described above. Then about a month a go a discovered a tool which was almost exactly what I was thinking about called Fever. Using the temperature as a nice metaphor for the things you are interested in, fever displays the information aggregated from your feeds based on how frequently an article is linked to. It is basically a personalized version of Techmeme. Looking at the screen shots it has a really nice design aesthetic which I can definitely appreciate and it even has an iPhone app. At this point I am looking for the nearest input box, chalk one up on the conversion rate I'm ready to use this thing. All jazzed up to get my RSS on and I'm greeted by the ugly credit card monster. An RSS reader that costs money, scoff. But given how nice it seems I figured I might consider spending a few bucks a month on such a nice tool. So how much does it cost? $30. That seemed like an odd price, and it is, the reason being that fever isn't a hosted service, its a desktop client. You plop down your $30 bucks for a license (you remember those things right?) get yourself the software and use away right? Not so fast, if you take a look at the answers section you will see the following:

What are the server requirements for Fever?
Fever requires a Unix-like server (no IIS) running Apache, PHP 4.2.3+ (preferably compiled with mbstring and GD with PNG support) and MySQL 3.23+.

Thats not your standard desktop software requirements list (e.g. dual-core processor, 1GB of ram) by a long shot. Instead, fever requires that you set up your own LAMP stack to host the stinking thing. While the software is nice, there is no way in hell I'm gonna set up and administer a server just to run it, thats just too much. Even with a variety of cheap VPS hosts out there, its just not worth the effort and cost.

After I gave it some thought, fever really is the perfect app for hosting. Its way to much of a pain for a single user to setup, requirements are fairly high, and you can pretty easily do database multi-tenanting. I looked around online to see if anyone was hosting fever but came up completely empty. Being a bit of an entrepreneur I figured this is something I could pretty easily setup, so I went ahead and contacted the developer to see if he was interesting in allowing others to host fever as a hosted service. The terse response I got was "I have no interest in offering Fever as a hosted application or in partnering with another party to offer Fever as a hosted application." Harsh.

Needless to say I was pretty disappointed. Not only did I loose out on the opportunity to make a cool little hosting business, but I lost the opportunity to actually use a really cool product. This is where I get back to the title of why being anti-web and why it is so uncool. There was a time when having software licenses and hardware requirements were all the norm, during that time many of todays software mega-corps (e.g. Microsoft, Oracle) were built. Unfortunately for them and much to the benefit to users everywhere, that time is rapidly coming to an end. Desktop software is simply a pain for everyone involved. Developers don't like it since its incredible difficult to make software for the different platforms plus all the complication of pushing out updates to all the users. Users don't like it because they have to fork over an arm and a leg to buy it and the onus to deal with updating it falls on them. Now I'm not going to fall of the deep end and say that all software is going to be hosted, but there is little reason that a significant portion of all the software can't live in the cloud. There are always exceptions for things like operating systems and really complex and resource heavy software but those only make up a minuscule fraction of the software in use today.

The moral of the story is that if you are making cool software that people want to use, do everything in your power to allow them to use it. If you want to charge, thats fine, but give it to users in the way that they want to use it and if you don't have the resources or desire to do that, by all means allow others to do so.

The Low Down On Cloud Brokers

The notion of a Cloud Broker is a new and interesting topic and I believe it is something which will be talked about increasingly frequently in the near future. If you don't believe me, the folks over at Gartner have discussed the importance of this role in the cloud computing space:

"The future of cloud computing will be permeated with the notion of brokers negotiating relationships between providers of cloud services and the service customers. [...] Enhancement will include managing access to these services, providing greater security or even creating completely new services." -- Frank Kenney

Given that they are going to prove to be important, its probably worthwhile to take the time and discuss what it is that these cloud brokers actually do. In my eyes, cloud brokers are an abstraction layer between the end user and the many cloud services at their disposal. While this notion could be applied in many different contexts, when I say cloud services I am referring to Infrastructure-as-a-Service providers (e.g. Amazon, Rackspace, GoGrid, Joynet). With a plethora of cloud providers, each with a their own API/set of services/pricing model/etc it would be quite cumbersome for the end-user to programmatically access each service. Instead, the cloud broker creates the layer of abstraction between the user and providers so that the end users can see one cohesive view of all of the services. This way the customer doesn't have to worry about the nitty gritty like the different REST calls required to create a server, they just hit the launch button and a server appears on the desired cloud. This is roughly the state of cloud brokering today from companies like RightScale, CloudKick, and Elastra (give or take a little bit). Yes, there is some work left to do by these vendors to support more clouds but its only a matter of time before they have most of their bases covered.

While that is the current state of affairs, the future of cloud brokers is something that is a lot more murky (at least in my mind). I think one of the big ideas that many people have for cloud brokers is that they are going to become these intelligent tools which the user tells what they want and then it figures out the best way to do that for them. So if I need a server and I need it now (at whatever it costs) then it will pick the appropriate cloud. Similarly, if I just want a server at some point this week but I want it cheap then it will wait around until it finds a good deal and  then starts up that server for me.

On the surface that seems all honkey dorey but in practice there are two really big issues with delivering such a vision. The first is transparency. Any time you are making an intelligent system which is making decisions on behalf of the user, it is absolutely critical that the decision making process be completely transparent to the end user. There are a lot of contexts where transparency is not terribly important (e.g. search results), but the second a users credit card is being charged the game changes in a big way. It is almost guaranteed that at some point you will get angry customers calling in asking why at some point your system made decision X where they expected decision Y. This is not the time to start having to explain the nuances of your cost-based least squares weighted estimation algorithmic ridiculousness. You might even try to be tricky and come up with some metric which is an amalgamation of a ton of different factors into one nice number scale (e.g. "Our system computed a doodle factor of 8.3 and as a result launched a new server"). While having that one number to point at is a nice trick, it generally does little to hide the formula that is used to compute it which inevitably becomes filled with unexplainable magic numbers. I believe that the result of this is going to be extremely simple systems which instead of trying to make the complex decisions are going to do their best to present the user with as much relevant information as possible and allow them to make the decision. If you imagine the case where the user wants a server very quickly one could imagine a system presenting you with the current cost and average boot time of a server across each of the clouds and then allowing the user to decide. The money vs speed trade-off is a complex decision and ultimately it is only the user who really knows how much they are willing to trade for one or the other.

The other major issue I see with these systems is that without variable pricing the utility of the intelligent system decreases greatly. As far as I know there are no cloud vendor out there changing their prices each day/hour/minute in order to account for changes in supply vs demand. While an interesting option and one any economist would be proud of, I think that vendors have moved away from this model as a result of the lack of transparency. What was once the brain-dead simple 10 cents an hour now becomes a decision requiring a lot more consideration. How much does it cost now? Well do I really need it now? If I wait will it get cheaper? And so on. Eventually I think a someone will try to do something along these lines and while I'm not a huge fan of the idea I wish them the best of luck. Getting back to the original point, until we start seeing clouds employing truly dynamic pricing the idea of having an intelligent system making decisions for me becomes a lot less necessary.

Ultimately I think the cloud broker space is one that is going to grow into many areas and service many niche markets. However as it continues to grow, don't be surprised as they all continue to keep a pole's length distance between themselves and the all knowing sentient system of doom.

Moving from SVN to Git - a users perspective

Just about 6 months ago I took a grad class on Scalable Web Services, the main portion of the class was to 1) come up with a cool app idea, 2) develop it using Ruby on Rails, 3) make it scale. Very cool class, I enjoyed it quite a bit and it definitely sparked my interest in Rails and web performance both of which I have learned a lot more about since then. After the class there was some discussion as to opinions on the technologies used and one thing that most of us agreed with was that git was frustrating and difficult to use. The good news is that in the months since that experience I have used git quite a bit more and as a result I am really starting to like it. So much so that if I were to start a project today I would use git without a doubt.

Given that I pulled a full 180 on my opinion of git I took a step back and started to think about what it is that changed that caused my opinion of git to change. When I really thought about it the conclusion that I came to was that the main difference was my mental model of git. Let me explain...

I would think that I am probably similar to a lot of folks in that up to that point I had used SVN a decent amount, a few projects here and there along with some personal projects/data. As a result of this my mental model of version control was pretty simple, checkout a repo, 1) update, 2) make your changes, 3) commit, see (1. It's pretty simple and I suspect that most people who have used SVN have agree. Given this mental model, when I started using git I really thought it of as SVN with another step, you do steps 1-3 above and add 4) push to remote to the list of steps. Like I said, same as SVN just with an additional step. Anyone familiar with git will tell you that if you are using that model you are missing out on some of the best and most powerful features of git, namely branching. While SVN provides the ability to make branches I never really got the sense that they are a first-class object in the same way that they are in git. To get a sense for why branches are important here is my work flow for a standard task that I am working on:

1. create branch
2. write failing tests
3. write code to make tests pass
4. commit
5. see 2 until task complete
6. merge branch back into master (smushing commits)

Anyone who is familiar with TDD will recognize steps 2 & 3 as the standard red-green loop but there are two other important things to note. If you are working in a tight red-green loop (which you should be) then step 4) is gonna result in a lot of commits which although the tests pass are probably not something you want commited into the record books forever and ever. In the standard SVN model you could either make all of these commits and spam the commit log with bunches of commits (and merging, ugh) or alternatively just not commit. The former is definitely undesirable and so is the ladder since its important to be able to go back to the last working state. The nice thing is that git allows you to get what you want in multiple commits which you can revert to while not spamming up the commit log. What you notice is that in step 6 when we merge back into the master branch we can either rebase (moves all the commits over) or my preference which is to merge without committing (smush commits). What this does is allow you to take all the commits you have made, smushes them together and then allows you to review what you have changed before you actually commit it into the master branch. For those of you wondering what I'm actually doing in step 6, assume you had your branch called "featureX" which had all the commits on it, you would checkout the master branch (git checkout master) and then do a merge using "git merge --no-ff --no-commit". Then you should have all the changes you made on the featureX branch staged ready to be committed, after you review it of course. It also helps that Macs have GitX which is one of the nicest diff tools I have seen to date.

One other awesome feature of Git that I use all the time (read: way too much) is stash. How many times have you made a change to something that you didn't want to commit but just wanted to keep around in case you needed it later? I seem to need this all the time, whether its simple config changes or some silly snippet I want to try out. If I want to keep that code around without having to commit it I can just do "git stash save some silly snippet" or more concisely "git stash" (the former adds a label to it). Then anytime something like that comes up you can just stash it away for later use.

So if you've been scared to make the switch from SVN to git, don't be scared, it wont bite.

Linux tools that make you go zoom

Over the past few years I have used Linux almost exclusively for programming and over that time I've gotten comfortable with a lot of the standard utilities. Want to find a file? Use find. Want to find a regexpn in a file? Use grep. It almost seems like a given. Until a little while ago I didn't even question whether there were alternatives since they are pretty much the standard go to tools and have been around forever. More recently I have begun working on larger source trees which I am less familiar with and as a result find myself using these tools a lot more frequently. Whereas previously waiting a bit was acceptable it quickly became a drag for me and I went looking for better alternatives.

I know it seems like a bit of heresy to replace the beloved grep but honestly ack makes it worth it. To show you the difference I'll do a quick search for the string "foobar" in a source tree that's ~500MB and ~9000 files.

Yeah. Seriously. The recursive grep on the directory took over 40 seconds to make it through while ack took under half a second. Anytime you get a two orders of magnitude speedup and its more convenient/fewer characters to type I find it hard to argue against that. The nicest thing about ack is that it automatically ignores all the standard version control files (e.g. .svn) which usually cloud up normal grep results with unwanted results. Of course all that is configurable. So how do you get this magic tool? Its simple:

If your a TextMate user you should be happy to know that there is a awesome bundle which allows you to find text in files really quickly called Ack In Project. I've been using it and found that the interface is awesome and much faster then the grep script I was previously using. The only downside to ack is that the default colors for it are pretty ugly but that can easily be changed via environment variables.

The other thing I frequently need to do is finding files in the file system. Its always something like "where is the config file for X again?" Since those things are frequently in the most bizarre of places I usually end up searching the entire file system which takes forever. Take a look...

Yep, find takes a whole 5+ minutes to get through the file system plus it spits out a bunch of permission denied messages where are really unnecessary in this context. On the other hand locate speeds through and finishes in a second and a half. Nice. The secret sauce is that locate uses a database of the file system the find things much quicker. Since its a standard linux utility and its doing it anyways, so why take advantage? It should be noted that locate searches the entire file system each time while find can search only relative paths so if you are searching a small folder find is usually faster while locate stays pretty constant.

While I'm a fan of the standard linux utilities, but anytime you can find solutions that really speed up things that you do frequently with little or no downside it seems almost foolish to not take advantage of. What are you waiting for?

How to handle mistakes

We've all heard the story far to often, company X makes a major mistake and does something they shouldn't have done, customers get outraged and start making a major hoopla, and only a few weeks later do they receive any admission of the mistake or maybe a weak apology. This type of situation happens way too often to count and really no one wins. Customers are usually left upset and companies are left with a tarnished image.

Last week Amazon made a bit of a slip up when they decided to delete copies of George Orwell's 1984 and Animal Farm from all Kindle devices. Apparently the copyright holder claimed that the version of the novel that was being distributed was violating copyright law. Amazon complied with their request by remotely deleting all copies of the novels and then refunding customers they cost they paid for the book. Amazon did notify users that the novels had been deleted and their money refunded, but only after the fact which didn't help things much.

So as the story goes above, customers get up in arms, allusions to Amazon being Big Brother are being typed faster then you can imagine, things are quickly getting ugly, what does Amazon do? Have Bezo's himself posted an apology on the Kindle forums. Short, sweet, he flat out admits that they got it wrong, no ifs-ands-or buts. While I applaud them for addressing the issue in such a straight forward manner, they made quite a daring decision in posting that apology in a public forum while allowing open comments. When I first saw that there were comments I not doubt expected them to be filled with much of the same venomous Orwell references that littered much of the press on the event. Much to my surprise, the vast majority of the comments I read were from from poisonous, in fact most of them were thanking Bezos for apologizing.

Being a skeptic I though that perhaps it was some of Amazons famous "is this post helpful" ranking magic happening to get some of the nastier stuff out of there. After clicking through the first 5 of so pages of comments (include numerous which apparently weren't very helpful) and there was little change. I'll admit that I didn't go through all 27 pages of comments and I'm sure there is quite a few less then helpful posts towards the end of it, but regardless of that fact, I have to admit that posting on forum with open comments is incredibly ballsy move but it seems to have worked out. Bravo.

On being Anti-Anti-SQL

There seems to be a draft forming as a result of a new movement of people who are "Anti-SQL" or "Anti-RDBMS". This article for example talks about the first meeting of what is being called the NoSQL community, and they are not alone. There are plenty of articles and blog posts online about "Thinking Beyond the Relational Database", "Ten reasons why CouchDB is better then mysql" and Beyond MySQL, a paradigm shift from RDBMS, noticing a pattern? While I'm sure there are many reasons why people are Anti-SQL the key points of contention are generally (in no particular order):

1. It doesn't scale well in terms of request rate
2. It doesn't scale well in terms of data size
3. Too many unnecessary features (e.g. joins, transactions)
4. They are slow

The overarching argument here is that since relational databases are very general-purpose and have to support just about all use cases that they suffer as a result, either in terms of performance or scalability. This observation has lead those members of the anti-SQL camp to take the opposite approach, namely start with a minimal feature set and only add that which is necessary. This approach has resulted in key-value stores like BerkeleyDB, Tokyo Cabinet, and Memcached, column-0riented databases like BigTable and its open-source counter parts HBase/HyperTable, document databases, graph databases...and so on. While each of these datastores are unique in their own right, I'm not going to spend time discussing the relative merits of one over the other and will instead point you to the awesomely titled (and informative) talk "Drop ACID and think about databases", this concise writeup which compares their relative features, and this very matter of fact review of many of them.

Having used a large number of such datastores I can confidently say that if the use case you are looking to fulfill can be met by one of the previously mentioned datastores that there are some great performance gains to be had. Performance gains are awesome, but the crucial part of the previous sentence is whether you can find a datastore that meets your use case. Like it or not, we have all become accustomed the relational database. It is an important part of just about every web framework, and is the crucial component of a large portion of web applications (chicken and egg discussion left as an exercise for the reader). While there are likely some cases where problems fit brilliantly into such systems, there are many which simply do not. My advice on this is simple, if you need the performance/scalability or it is a natural fit for your application then by all means go ahead, but the second you have to start employing any sort of trickery to get your app to fit into that model, stop right there, you are fighting a battle that you've likely already lost.

This leads to my last point and the one that inspired the title of this post which is my opinion of being Anti-Anti-SQL. That is not to say that I am against non-relational databases, I am in fact a very big believer of them and have extensive knowledge about quite a few. Instead, I have an issue with those who are trying to demonize the relational database into something that is outdated, decrepit, and woefully inept at the task it performs. Granted the relational databases is not without its issues, but realistically the number of applications which would fit into a non-relational database is fairly small, and the number which require it for performance reasons is even smaller. There are the Google's, Amazon's, and Yahoo!'s of the world but for every one of them there are an infinite number of tiny web services who could only dream of hitting the scalability bottleneck of their database.
If the Anti-SQL community truly feels like the relational database needs to be sent out to pasture then what they need to do is simple, change the way we (application programmers) think. Almost all web applications being developed today assume the existence of a relational database and code accordingly. Non-relational databases are generally not considered until 1) it is a necessity as a result of scale/size or 2) the developers previously experienced 1). Instead the goal should be to make the non-relational database into a first class citizen, the default unless a relational database is absolutely necessary.

This is clearly not an easy task, but in order to have a chance it is crucial to focus on the things that can make a difference. If the adoption of high level scripting languages has taught us anything, it is that people are willing to sacrifice performance for ease of use. While you could spend time bit-twiddling your C code trying to eek out a .5% improvement, instead spend time writing a good Ruby/Python/PHP plugin to interact with your datastore. In essence it all boils down to this:

While performance gets your foot in the door, usability makes the sale.

Got Infinite Scalability?

I was perusing the usual RSS feeds today when I ran across a link to New Yorker article written by Malcolm Gladwell. The article is apparently a pretty lashing review of a book titled Free by an author named Chris Anderson. Since I'm a pretty big fan of Gladwell's work I thought I would head on over and take a gander. Only one problem, the website has been timing out for the past 30 minutes. Thats right, the New Yorker is likely getting more traffic right now then it received all of last month, yet its servers are seemingly balled up in the corner weeping and asking it to stop. Considering the fact that the article is currently is linked to by Slashdot, Techmeme, and Silicon Valley Insider (and probably a ton of literature focused blogs) I can't say I am surprised that they cant handle the traffic, thats a flash crowd and a half.
Its situations like these where having the ability to scale out your website is quite critical. It's not like they didn't know it was coming...they must have seen a decent uptick in traffic when the Silicon Valley Insider article and Techmeme link came out yesterday morning. I'd be very surprised if there wasn't some inclination that things needed to be ramped up.

The cynic would note that flash crowds from Slashdot are notoriously difficult to deal with and that one gets almost no warning. While this is true, the Slashdot article came out over six hours ago. By now they could have gone down to the local Fry's, bought some cheap servers, formatted the hard drives, and had their entire stack up and running if they wanted to. From the looks of a quick whois on their domain it seems like they are hosting their own servers, whoops.

Since it is apparently my new favorite website, I ran a pagetest on their site (sorry servers) and the results go to show that what I was talking about in my last post is pretty damn important. If you take a look at those results you will see a waterfall so big you'll want to swim in it. After 11 seconds and an impressive 123 requests the website should be nice and loaded for everyone to enjoy. Alright, first load performance is hard but at least they are using expire headers right? Ooops. Yep, thats right, the second page load takes a whole 116 requests, and just about all of them are 304's (Not Modified). The fact that its not actually fetching the data makes the second load a little faster (~7 second) but its still pretty slow, and when you are getting tons of traffic having to serve over 100 requests per page load makes a server quit on you much faster then it should. It should be clear that they aren't using multiple asset hosts or any thing fancy like that, but what is the kicker of the whole thing? Not only are they loading a zillion objects, but take a guess where their server is located. Dallas? San Fran? Virginia? Nope, how about the Netherlands. And there is my queue to stop trying to explain what on earth they are thinking...

Optimizing Blog Performance For Fun and Profit

At some point over the last year I have developed into a speed freak. No, not the kind who hang out in back alleys doing questionable things, the kind that wants to make websites fast. Just in the past year or so I have picked up a ton of relatively simple techniques for 1)figuring it out why a website is slow and 2)improving the performance. I was taking a look at my blog this evening and while I did not feel like it was particularly slow, I knew that there were a lot of things which could be improved so I thought it would be a good exercise. Since it is a relatively simple and closed experiment, it seemed worth while to document here so that others could use it to guide them, so enjoy :)

Lets first start by determining why my blog is slow so that we can focus on what could yield the most improvement. If you have not heard of or read Yahoo!'s "Best Practices For Optimizing Websites" now would be the opportune time to do that. That document has become a Bible of sorts for website optimization, they do a very good job of thoroughly covering all different aspects. One thing to note is that in many cases there are practices which simply don't apply or are not an option in many contexts (e.g. use a CDN), don't worry, that's fine. Its definitely not an all or nothing system, so do as many as are feasible.

Now that we've read all of these best practices, we want to see where our site does not follow them. For the rookie optimizer, you may be surprised to learn that there are actually quite a few utilities available for studying website performance. Not surprisingly, the most popular is a Firefox plugin named YSlow which was developed by the folks at Yahoo!. Not to be out done, Google has created their own Firefox plugin called Page Speed, while the interface and rules are slightly different it is similar to YSlow in more ways then not. While both of these plugins are great for quick checks and such, what you will often notice is that performance can vary greatly from refresh to refresh which can be quite frustrating when you want to find out if change X actually helped or hurt your website. As a result, I use the terribly-named Pagetest, it follows all the same rules as YSlow it simply provides additional useful functionality like running trials multiple times, along with visiting multiple urls, and my favorite the waterfall chart.

So I went ahead and put my website on Pagetest, had it do three runs and then waited for the results. Instead of putting images on my site of the test results, here is a link to them so you can explore along with me, I'll be discussing run #2 since its the closest to the average. To start lets look at the waterfall view. On top it provides a nice little summary of the most important information, so from there we see that the load time is 4.5 seconds. Just for a point of reference we check Alexa to see that Google has an average load time of 0.7 seconds and Yahoo!'s average is 2.6. While my site is quite a bit slower, it is somewhat expected when you consider the amount of time and man power both companies are able to dedicate to optimizing even the smallest of things.

In looking at the waterfall view there are a few important thing to point out, 28 requests are made most of which are images from cs.ucsb.edu (which is where I was hosting the images for my blog). Since that was the first time the page was being loaded all of the images/css/scripts needed to be loaded so this is in effect paying the full penalty since you have to get everything. Ideally, since most of that content is static we could utilize the browser cache such that those objects don't need to be fetched again. We can then take a look at the Repeat View which is the second time the page is loaded, if caching is properly utilized we should see that a lot of the static content is not being reloaded since it is in the browsers cache. In the repeat view we notice that the load time has gone down to 2.4 seconds which is a nice improvement, but that there are yellow lines on most of the requests to cs.ucsb.edu for images. Below in the request details it shows that all of those requests are returning a 304 status which is basically the web server saying that we already have the most up-to-date version of the object, so use the version we already have. While that is definitely faster then before it is important to notice that over 1/2 of the time is wasted making those requests. If you head over the Performance Overview you notice that all of those images have little x's marked in them under Cache Static. What this means is that the server is not setting expire headers on those images. Without an expire header the browser can't tell if the images are still valid when reloading the page, thus it still makes those requests just to check if they are up-to-date only to get 304'd.

Another observation to be made is that even though there are a lot of requests being made, they are not all happening in parallel. Ideally, once the CSS was fetched the browser would know that there are 10 more images to get and start downloading them in parallel. Unfortunately that doesn't happen. The reason is that most browsers are configured to only allow two parallel connections to a given hostname, so even after the css has been fetched and the browser knows it needs to fetch 10 more images from cs.ucsb.edu, it can only grab two at a time. That is a bummer since the images have already been optimized (using smush.it) so they can't get much smaller. There are solutions for this problem which we will discuss below.

Now that we know all of this information it is time to move to phase two which is actually improving the performance. Since I have no control over the server which was hosting the images, I could not configure it to use expire headers and thus a more appropriate solution was needed. While I could host the images on a CDN, they are generally pricey and its not something I'm interested in paying for on a relatively low traffic blog. Instead, I chose to use Picasa, Google's image hosting service. A few of the reasons for choosing Picasa 1) it's free, 2) its fast, 3) I could just login with my Google account instead of needing to create another account somewhere (e.g. Flickr). I grabbed all of the images my site used and then uploaded them to Picasa in a snap. After that I just had to switch all of the images references from the old urls to the new Picasa url. After doing that I reran the performance test and here are the results . Right off the bat we see the load time has gone down to 3.35 seconds on the first load, which is over a 1 second improvement from the original 4.5 second load time. Its important to note that this is not a result of expire headers since that only comes into play on the second page load, something else is happening here. Unlike in the previous case where images were being downloaded from cs.ucsb.edu two at a time, we now see that many more images are being downloaded in parallel as many as 10. If you read Yahoo!'s Best Practices site you will remember the "split components across domains" recommendation which is what is improving the performance. Previously we found that only two connections can be made to a hostname at any given time, but its important to notice that host1.website.com is a different hostname then host2.website.com. It fact can be exploited to allow more then two parallel downloads to occur for a given site. Taking a look at the waterfall chart we see that now the images are spread across lh3.ggpht.com,lh4.ggpht.com, and lh5.ggpht.com (the urls for different Picasa servers). Since the images are spread across these servers the browser can make two connections to each of the three hostnames (six total) at any point in time. A relatively simple and harmless change, but it shaves a whole second off the page load time.

Now we can shift over to the results of the Repeat View to see how well the page performs with a hot cache. The summary shows that the load time dropped to 1.46 seconds, down from 2.4 seconds previously. This is where the expire headers are kicking in. Looking at the waterfall chart we see that only 4 pages are loaded, and this is because the rest of them have been cached, in particular the images. Since the expire header is set there is no longer any need to make the image requests (two at a time) only to get a 304 back from the server which wastes both time and resources. Instead those images are taken from the browser cache which is much faster and as a result the load time is cut down quite a bit.

While the above description was quite long, it was two simple changes that cut down of page load time by 25% and 40% in the first and second load cases. I didn't have to shovel out a bunch of money on a CDN or pay some expensive consultant to do it for me, it was all free and didn't take much time at all. While these to optimizations are by no means the only ones available, they are usually easy to implement and can give you big gains. While I could continue to improve things via more advanced or cumbersome improvements, there is likely going to diminishing returns either in terms of performance/maintainability/cost which makes me less inclined to do it.

Whats the moral of the story? There are easy ways to make your website faster, do it and make your users/viewers happier (one second at a time).


Browser Knowledge

Just when I start to think that our culture has finally become as tech-obsessed as I am, I get a reminder that they are not and I can breathe a little easier. What the hell am I talking about? This interesting little video put together by a Google intern asking people what a "browser" is.





Alright, so clearly some of them are confusing a web browser (e.g. IE/Firefox) with a file browser (e.g. Explorer on Windows, Nautilus on Linux), and admirable mistake. The fact that on Windows there is Explorer and Internet Explorer both of which do very different things doesn't help the cause very much either. Why exactly so many people think that Google/Yahoo is their browser, I don't really know what to make of that. I guess since that is the first thing they see when they open their browser, they assume that is what their browser is as opposed to the icon/name that they clicked on to open it....maybe...who knows.

Perhaps my favorite part of the video is the guy who proudly says he uses Firefox, and then when asked why he admits that a friend removed (probably from the desktop) all of his other search engines and said to use Firefox. What a good friend.

Why Bing?

At D7 today Steve Ballmer announced that Microsoft is going to be releasing a new search engine called Bing!. Apparently when asked why the name Bing! Ballmer gave a few pretty generic reasons like its short, easy to say, and works globally. Sure, thats all good and nice, but that really doesn't explain why they picked that name.

Had you asked me I would have said, "Because Bing! is the sound you make when you know that answer to the question, and thats exactly what Bing! does."


Arrogance


Typical IE arrogance, they don't see Firefox as a threat? Sure.

Is reporting outages really news worthy?

If you are foolish enough to read as much tech news as I do you probably heard about Google having some issues today. Reports said that search, GMail, blogger, and a few other services were hit for about an hour [official announcement here]. Thats right, reports said. As is people reported on this. If you didn't notice, every single word in that sentence was actually a link to a different article, reporting on the down time. If I was really determined I would have done that with this entire article, I'm sure it possible I am simply unwilling to copy paste that much to make my point. My favorite part of most of these articles is that they all seem to contain a phrase similar to "We can access Google just fine, but people on Twitter are reporting issues." They are absolutely right, when I first heard about this I did a twitter search for everyones new favorite #googlefail and sure enough tweets were popping up faster then I would glaze my eyes over them. Literally within 5 seconds of doing a search, it already said that 80 more people had tweeted with #googlefail in it. Here is my question, why is it that no bloggers seemed to be affected, but everyone and their grandmothers on twitter was going bezerk, coincidence?

Not being able to access websites is a tricky issue, its often not very clear what the issue is. Is it my computer? Is it my router? Is it [favorite website of choice]? Or is it somewhere is the middle? Thats why I actually see value in websites like http://downforeveryoneorjustme.com/, there are simply lots of times when you have no idea. There is a great quote by a brilliant man name Leslie Lamport which says:

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
While I agree with the quote, I think I am going to have to modify it slightly for this occasion:

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unable for doing any real work, and as a result your only option is to complain about it on twitter.
Honestly, they put solitaire on your computer for a reason.

Kindle For Kids

Word is out that Amazon is planning on releasing a larger version of the Kindle on Wednesday. The new device is supposed to be geared towards magazines and textbooks which require a much larger screen then regular novels. If you've been here for a while you would know that I have high hopes for bringing the Kindle into the textbook market. That industry (empire?) is desperately lacking in competition and could use someone to ruffle their feathers a bit.

Unfortunately, the more I see of Amazons strategy with the Kindle, the more concerned I become. The current Kindle sells for $359 on Amazon and has a 6" screen, a few ticks on the abacus would tell you thats just under $60/inch. Now they are coming out with one with a bigger screen? Don't they know people can't get bank loans these days? In all reality I'm sure they don't charge per inch of screen real-estate but they have held on to that price point like it was the only one in town. My uneducated guess is that they are going to actually bring in the new Kindle above the $359 price-point and keep the Kindle 2 where it is. Lets just throw a number out there for good measure and say $429.

Once you get past its inflated price it will be interesting to see how they deliver on features. People who intend to use this as a replacement for textbook are much more demanding then those who are using it to read the latest Daniel Steel novel. As I previously harped on, search is critical since people are way too used to finding what they want very quickly. Beyond that, considering all the crazy sidebar notes/sticky notes/highlighting schemes people employ on their textbooks having a replacement or improvement on that would be important.

For a good opinion on the priorities in Kindle features I think this fake conversation between Jeff Bezos and Steve Jobs captured an important point:

Jobs (laughing): Surf the Web? On an Etch-a-Sketch?
And then proceeded to sum up my opinion on "value added" features:
Bezos: The Web is a value-added feature.

Jobs: No features are value added. They're either features or they're not.
Update: Wow its actually more expensive then I had guessed, it rings it at a hefty $489. At least there is free shipping!

Design Question

Why is it that for blogs its perfectly natural (and visually comfortable) to have the posts on the left or center of the page with the archive, tags, etc on the right. Whereas with RSS readers its perfectly natural (and also visually comfortable) to have the list of feeds on the left with posts on the right.

Or the more pressing question, why is it that if you would consider switching either of the two around it would seem ugly and uncomfortable?

Amazon-as-a-Platform(-Provider), Making Hadoop A Bunch Cheaper

I have to say that saw Amazons announcement today for Amazons Elastic MapReduce caught me off guard quite a bit. Having previously worked on a project which used a reasonably sized (100 nodes) Hadoop cluster running on EC2 I am familiar with many of the pains of setting up and running Hadoop on EC2. The reason why I found this announcement so surprising is that it demonstrates Amazons willingness provide even more middleware services. Many of the other AWS services like S3, SQS, and SDB provide very fundamental services, namely storage and queuing. I'm not saying that everyone on AWS uses all three services, but instead that just about everything running on EC2 requires some form of storage and some form of queuing. Thus if Amazon can provide the services that just about everyone needs, there is a decent chance that Amazon can get some of those people to use their services either as a result of convenience or cost. MapReduce is definately quite popular right now, and for good reason, but it is a far-cry away from being as fundamental as something like storage.

The point of that whole discussion is this, Amazon just made their first true Platform-as-a-Service (PaaS) offering. I know Amazon is the Infrastructure-as-a-Service (IaaS) provider, but Hadoop is absolutely a platform in the same way that Rails or Django is a platform. Sure it doesnt serve out websites (unless you count the JobTracker) but that is in no way part of the definition of what makes a platform. This is a particularly interesting move becuase it opens up the possibility of Amazon providing more Platform layer services (Rails anyone?) and encroaching on the space currently occupied by the Google AppEngines and Herokus of the world. One might ask why would Amazon venture into already occupied territory, why compete with people already providing those services. Its simple, what is likley the most common use of Amazon EC2? If you guessed hosting websites you are correct, and if you guess hosting Rails websites you get a bonus point. Since that is the case, it is pretty clear that EC2 is providing things that Heroku is not, wether it be flexibility, cost, or otherwise. So why not exploit that fact, make your customers happy and make money from it as well?

The other thing that I found absolutely stunning about the Amazon announcement is the pricing. Take a look at this:

Go ahead, double take if you have to, make sure you got those decimal places right. Yeah so running a job using Elastic MapReduce (EMR?) is effectivly 15% of what it would cost to run it yourself on EC2. Ridiculous. To be honest it doesnt not make any sense to me that they are able to offer such a discounted price for a service that gives you the same exact machine as what you would get for $0.10/hour. I am going to have to think about that one for a while.

Either way, that made something that was already dirt cheap ($0.10/hour) into something even cheaper than dirt ($0.015/hour), and I am very excited about the prospects and implications. As stated on CNET "Bring your datamining to us".

Update: Dave has correctly pointed out that I missed a very big sentence in the Elastic MapReduce description which is "Amazon Elastic MapReduce pricing is in addition to normal Amazon EC2 and Amazon S3 pricing." That makes quite a bit more sense.

Even still, the 15% premium is a tiny price to pay to not have to deal with bringing up and tearing down servers all the time, along with the headache of actually getting the thing setup.



Are Macs Really More Expensive?
A Practical Study

Microsoft recently started a new advertising campaign which truly speaks to the consumers wallet in these troubled economic times. Simply put, the advertisement (can be seen here) states that PCs are cheaper than Macs, and what people are really concerned about these days is price. This ad is very much in line with the recent comments made by Microsoft CEO Steve Ballmer who stated that Mac users are "paying $500 more to get a[n Apple] logo on it". This however is not ground breaking news; the fact that Macs are more expensive (commonly known as the “Apple Tax”) is one of the most common retorts for the PC people in the Mac vs. PC debate. This new advertisement has definitely sparked up this debate quite a bit with articles being featured on Slashdot and Gizmodo, among many other tech sites.

I have found myself in the Mac vs. PC debate quite a few times, generally when a friend asks for advice on buying a new laptop. In general the argument goes as follows, “PCs are cheaper, and the baseline specs are going to be higher, but Macs are known to be more resource efficient so the higher specs don’t necessarily mean a faster machine.” The conversation then generally returns to the issue of price where it is pointed out that a decent low-end Dell or HP laptop can be purchased for $700-$800 whereas with Macs the low-end is $1000-1100. The fun really beings when it is pointed out that while Macs are more expensive they generally have much higher resale value. While this is generally true, it was never clear to me how much higher the resale value actually was, and if that higher resale value actually covers the higher initial cost. That is why I chose to do a little research on the topic to answer the question, are Macs really more expensive than PCs?


The Experiment

In order to determine which laptops are more expensive I would first find popular laptops which were released approximately three years ago, determine the resale value of these laptops today, then determine the “adjusted retail price” which is its initial sale price minus its resale value three years later. The decision to pick laptops which are three years old is twofold, first three years is generally the time frame which many people begin selling their old laptops and looking for new ones. Also, it was about three years ago that Apple switched from PowerPC to Intel. Based on these two factors three years seemed to be the ideal time frame for how old the laptops should be.

Once that time frame was decided upon, I the next task was to find popular laptops which were released around that time. After a little bit of hunting around I found the following laptops:

• Dell Inspiron 1501 ($649), Released 11/2006
• Dell Inspiron e1405 ($779), Released 9/2006
• Apple MacBook Early 2006 ($1099), Released 5/2006
• Apple MacBook Late 2006 ($1099), Released 11/2006
• HP Compaq nc6400 ($1119), Released 9/2006
• Lenovo Thinkpad T60 ($1399), Released 11/2006

These laptops were all released between two and a half and three years ago and are some of the most popular models offered by the given manufacturer at that time.

Now that the laptops were picked their approximate resale value needed to be determined. While there are services which will tell you an approximate value of a laptop I’m not particularly confident in their accuracy. Instead, I wanted to see how much money these laptops would fetch if they were being sold by your average Joe. What better place to look then everyone’s favorite auction site, eBay. Using eBay I then located successfully completed auctions for each model listed above. When picking completed auctions I set aside a few ground rules, it had to be in working order and no large defects (as described by seller), it could not be refurbished, and it could not have a warranty. Considering laptops which violated any of these conditions would have significantly increased the variance in the prices (consider a “for parts” laptop vs. a working laptop with a 2 year warranty).

I chose to sample three auctions for each of the laptop models to try to avoid any oddities in the sale price of a particular laptop. While this is a tiny sample size it was sufficient to get a reasonable estimate for prices, particularly since the sample pool was fairly constrained. It should be noted that some of the auctions for a given laptop model had slightly varying specs (e.g. 60GB vs. 80GB hard drive or 1GB vs. 2GB of RAM) but most were minor modifications likely done by the manufacturer, I avoided any laptops with significantly improved specs (e.g. 4GB RAM). While these improvements were likely paid for by the original buyer, determining how much these improvements cost at the time of purchase is an exercise I am unwilling to subject myself to. As a result, each laptop is considered to cost its base price, and thus the percentage of cost recouped upon resale is an upper bound.


Results

The full results of this experiment have been recorded in this Excel Spreadsheet, they have been summarized in the figure below.

Figure 1

Figure 1 displays the initial sale price of each laptop along with its adjusted price which is simply its initial price minus the resale price. The adjusted price is in effect its “actual” cost assuming the laptop is resold in a working condition three years after it was purchased. From this figure it is clear that while the Dell Inspirons are sold at a considerably lower price point than the Macs, the adjusted price of the two models are very similar, particularly in the case of the e1405 which has a Intel Core 2 Duo processor (as opposed to the AMD Turion in the 1501). This indicates that while the Macs are in fact $200-300 more expensive than Dells, that extra cost is recouped upon resale as a result of Macs increased resale value.

Next is the comparison between the Macs and the pricier PCs, namely the HP and Lenovo models. Both these models had price points above that of the baseline MacBook, with the Lenovo fetching a pricey $1,399. From these results it is clear that the PCs which do cost more than their Mac counter parts do not recoup nearly as much of their value upon resale and result in adjusted prices significantly higher than Macs.

Figure 2

This is illustrated more clearly in Figure 2 which displays what percentage of the initial sale price which was recouped upon resale (higher is better). Three years after being purchased Macs recoup an impressive 51-52% of their initial sale price while Dells net around 36%. The more expensive HP and Lenovo recoup a measly 26-28% of their high initial prices. In the case of the Lenovo, its $1006 adjust price is nearly twice that of the Macs and Dells. In fact, while the HP and Lenovo are nearly double the price of the Dell e1405, they only receive an additional $20 and $120 in resale value.


Conclusions

After seeing these results I am pretty clearly convinced that Macs are not more expensive than PCs. In fact, Macs are very much comparable pricewise to much cheaper Dells when resale price is factored in. Macs are much cheaper than more expensive models sold by HP and Lenovo which do not recoup nearly as much of their high cost upon being resold.

While I’m sure this will not put to an end the Mac vs. PC argument, the old argument about Macs being more expensive is simply not the case, especially when considering higher end PCs.


A special thank you goes out to Bryce Boe who proof read this post for me.

Blue Steel And Fun With Colors

Welcome to my new and improved blog now known as Regular Expression. It is now being hosted on the domain http://regexprn.com but http://jmkupferman.blogspot.com should continue to work (including RSS).

After short amount of time coming up with a new name, a long time looking for domain names, a relatively short time making the design, and a really long time trying to make it into a Blogger theme, its finally made it. Its incredible how the "easy" parts take the longest time while the more difficult parts just seem to work themselves out.

When I was working on the design I ended up getting myself into a bit of a snafu which lead to a somewhat interesting journey that result of which is the design you are looking at. I liked the name Regular Expression because like the pun, its pretty self-explanatory, and was able to snatch up a nice short domain name to host it on. Once I had the domain name it was time to get going on the design. I had some ideas floating around in my head, in particular I wanted a large text logo with regular expressions in the background. I recalled seeing a tutorial that had background text and a logo which was made up of the background text (available here). After a bunch of tweaking and changing I ended up with this:
Unfortunately I was unable to get reasonably smooth looking text, so it ended up looking more like I had drawn the logo on a fabric softener sheet, and thrown it in the wash. I decided to abandon trying to write the logo out of the background text and just wrote it out normally and that looked much cleaner, and easier to read. I then went ahead and added a little navigation bar, in this case it is more of a link bar, but it looked nice and I was happy with the little bit transparency.
Since its just a blog all you really need is a header and color theme and your all done right? Sure, its just a matter of adjusting the template to fit the color scheme and you should be on your way. This is where things got a bit tricky. First, when trying to modify the template I couldn't get the links to work. It was incredibly bizarre since they worked in the simple HTML site I made and as far as I could tell Bloggers template engine wasn't doing anything funny. I just played around with the HTML using Bloggers template preview feature to try to nail down the problem. (insert a few hours frustration and relearning of the loveliness that is floats). Long story short, Bloggers template preview has a bit of an issue. It would render the links correctly, but they simply weren't clickable. It wasnt until I actually submitted a template (and made my blog ugly for a few minutes) that I figured out that it was preview that was the issue. How frustrating.

I finally got the template looking alright so next it was time to figure out colors. What I very quickly figured out is that the black-grey-red combination I had was going to be difficult to work with. The accent color (red) was particularly tricky since its very difficult to find a shade of red that is readable for any decent sized line of text, say a title. While I could have used it in small spots like an underline or the numbers in the sidebar, it was gonna make finding other colors for text, titles, and links quite a bit more difficult. After looking around for a while to see if I could find a few ideas on what would work, I figured I would switch things up and try a lighter design.

I had pretty quickly arrived at something which I thought looked pretty nice, still kept the theme, and was able to use red in a readable way. I was pretty surprised how quickly it came together and how nice it looked, and then I remembered why. It was a design I definitely had already seen before. Apparently my subconscious remembers websites incredibly well, even down to changing the red links to black with an underline on hover actions. That's no fun. So back to the drawingboard I go.

I was playing around with templates and ended up with my light header on top, and old (dark) blog design on the bottom, and said hmmm. I liked the blue that I was previously using so I went ahead and changed the red in the logo to a nice cool blue, switched up the link colors and was on my way. The only problem left was the header, it was dark while the main background was white. I tried changing it back to the white one, but it wasnt working. Instead I kept the black, and just eased the transition between the two a little more. The first way was to just add a little fade below the nav bar, pretty simple. The second way is a little bit tricker and I actually stumbled upon it by accident, but I liked it. I'll give you a hint, while the nav bar is somewhat transparent it darkens the background as opposed to lightens it.

After all of the messing around I finally ended up with the page you are looking at.
Whats the moral of the story? Start with a diverse color pallete (kuler is a nice tool for that, thanks Mike) and then let the design come from that. Its a lot easier to not use a color in a pallete than it is to add an additional one. It will also likley stear you away from having a very plain black and white website, but dont go overboard.

iPhone OS 3.0 Quibbles

The announcement a few days ago that the iPhone OS 3.0 will be previewed in the coming week generated quite a bit of buzz among tech news sites. With all sorts of excitement and nothing to really talk about, it seems like the conversations on these matters inevitably turn into "Comment on what you think it will be" type of affairs. Interested in what others think might be on the horizon for next weeks preview I took a look at some of those comments and what did I see....

"Cut and Paste, please?"
"come on MMS and copy/paste..."
"unless this adds copy and paste, mms etc this is going..."


Really? That's what people are looking for? MMS...and copy & paste...really.
MMS just seems like a huge waste to me, if you though texting was a ripoff, you haven't seen MMS charges. Furthermore, why not just email? iPhones can open email attachments, particularly images. I'm not quite sure what people find so incredibly attractive about this, but perhaps that just me.
The title of this post is iPhone OS 3.0 Quibbles, emphasis on quibbles, because that is exactly what copy and paste is. I have tried my hardest to think of cases where I would have used copy & paste, and I would have to say that the number of instances of that occurring is less than a handful. In particular I remember one time when someone sent me an email which contained their phone #, unfortunately the iPhone did not recognize that it was in fact a phone number (it does, but that feature is somewhat spotty) and didn't provide the option of clicking on a link to dial it. So I had to remember a phone number for a whole three seconds while I entered it into the phone, no biggie. I have a sneaking suspicion that the people who are really excited about copy & paste are frequently typing "OMGOMGOMGOMGOMGOMGOMGOMGOMGOMG!". If that is the case, then please no copy & paste, there is plenty of that already.

So the natural question to then ask is, if these features are blaze, which ones should I be hoping for/expecting? Landscape keyboard is a popular choice, I don't think you can really argue against this, who knows why it isn't done yet.
Ultimately, one of the most important features that people don't mention enough is lovingly known as Push. It was supposed to launch in September 2008 (Apples words not mine), and yet there has been hardly a peep about this. No mention of it, no explanation, nothing. This is somewhat unfortunate since the addition of Push is likely one of the biggest changes we will see in the iPhone. The classic example is apps like AIM which could actually use push to notify users of new messages without having to open the app itself. AOL has gotten so impatient that they chose to hack their way around this by offering to send users texts when they receive a new message, also allowing them to remain logged into AIM for up to 24 hours after then exit the app. Email is obviously another no-brainer. No reason to waste battery life polling every X minutes, push is clearly the right way to do it.
There are tons of other apps out there which could use Push to improve user experience. How about RSS feed readers? Sports scores? I would mind being notified every time the Lakers were playing. I'm sure twitter'ers wouldn't mind being notified upon getting tweeted to (that cant be right).

So while people continue to hope for the improvements on the minor quibbles they have, I'm holding out for the big improvements which can change the game.