The Low Down On Cloud Brokers

The notion of a Cloud Broker is a new and interesting topic and I believe it is something which will be talked about increasingly frequently in the near future. If you don't believe me, the folks over at Gartner have discussed the importance of this role in the cloud computing space:

"The future of cloud computing will be permeated with the notion of brokers negotiating relationships between providers of cloud services and the service customers. [...] Enhancement will include managing access to these services, providing greater security or even creating completely new services." -- Frank Kenney

Given that they are going to prove to be important, its probably worthwhile to take the time and discuss what it is that these cloud brokers actually do. In my eyes, cloud brokers are an abstraction layer between the end user and the many cloud services at their disposal. While this notion could be applied in many different contexts, when I say cloud services I am referring to Infrastructure-as-a-Service providers (e.g. Amazon, Rackspace, GoGrid, Joynet). With a plethora of cloud providers, each with a their own API/set of services/pricing model/etc it would be quite cumbersome for the end-user to programmatically access each service. Instead, the cloud broker creates the layer of abstraction between the user and providers so that the end users can see one cohesive view of all of the services. This way the customer doesn't have to worry about the nitty gritty like the different REST calls required to create a server, they just hit the launch button and a server appears on the desired cloud. This is roughly the state of cloud brokering today from companies like RightScale, CloudKick, and Elastra (give or take a little bit). Yes, there is some work left to do by these vendors to support more clouds but its only a matter of time before they have most of their bases covered.

While that is the current state of affairs, the future of cloud brokers is something that is a lot more murky (at least in my mind). I think one of the big ideas that many people have for cloud brokers is that they are going to become these intelligent tools which the user tells what they want and then it figures out the best way to do that for them. So if I need a server and I need it now (at whatever it costs) then it will pick the appropriate cloud. Similarly, if I just want a server at some point this week but I want it cheap then it will wait around until it finds a good deal and  then starts up that server for me.

On the surface that seems all honkey dorey but in practice there are two really big issues with delivering such a vision. The first is transparency. Any time you are making an intelligent system which is making decisions on behalf of the user, it is absolutely critical that the decision making process be completely transparent to the end user. There are a lot of contexts where transparency is not terribly important (e.g. search results), but the second a users credit card is being charged the game changes in a big way. It is almost guaranteed that at some point you will get angry customers calling in asking why at some point your system made decision X where they expected decision Y. This is not the time to start having to explain the nuances of your cost-based least squares weighted estimation algorithmic ridiculousness. You might even try to be tricky and come up with some metric which is an amalgamation of a ton of different factors into one nice number scale (e.g. "Our system computed a doodle factor of 8.3 and as a result launched a new server"). While having that one number to point at is a nice trick, it generally does little to hide the formula that is used to compute it which inevitably becomes filled with unexplainable magic numbers. I believe that the result of this is going to be extremely simple systems which instead of trying to make the complex decisions are going to do their best to present the user with as much relevant information as possible and allow them to make the decision. If you imagine the case where the user wants a server very quickly one could imagine a system presenting you with the current cost and average boot time of a server across each of the clouds and then allowing the user to decide. The money vs speed trade-off is a complex decision and ultimately it is only the user who really knows how much they are willing to trade for one or the other.

The other major issue I see with these systems is that without variable pricing the utility of the intelligent system decreases greatly. As far as I know there are no cloud vendor out there changing their prices each day/hour/minute in order to account for changes in supply vs demand. While an interesting option and one any economist would be proud of, I think that vendors have moved away from this model as a result of the lack of transparency. What was once the brain-dead simple 10 cents an hour now becomes a decision requiring a lot more consideration. How much does it cost now? Well do I really need it now? If I wait will it get cheaper? And so on. Eventually I think a someone will try to do something along these lines and while I'm not a huge fan of the idea I wish them the best of luck. Getting back to the original point, until we start seeing clouds employing truly dynamic pricing the idea of having an intelligent system making decisions for me becomes a lot less necessary.

Ultimately I think the cloud broker space is one that is going to grow into many areas and service many niche markets. However as it continues to grow, don't be surprised as they all continue to keep a pole's length distance between themselves and the all knowing sentient system of doom.

Moving from SVN to Git - a users perspective

Just about 6 months ago I took a grad class on Scalable Web Services, the main portion of the class was to 1) come up with a cool app idea, 2) develop it using Ruby on Rails, 3) make it scale. Very cool class, I enjoyed it quite a bit and it definitely sparked my interest in Rails and web performance both of which I have learned a lot more about since then. After the class there was some discussion as to opinions on the technologies used and one thing that most of us agreed with was that git was frustrating and difficult to use. The good news is that in the months since that experience I have used git quite a bit more and as a result I am really starting to like it. So much so that if I were to start a project today I would use git without a doubt.

Given that I pulled a full 180 on my opinion of git I took a step back and started to think about what it is that changed that caused my opinion of git to change. When I really thought about it the conclusion that I came to was that the main difference was my mental model of git. Let me explain...

I would think that I am probably similar to a lot of folks in that up to that point I had used SVN a decent amount, a few projects here and there along with some personal projects/data. As a result of this my mental model of version control was pretty simple, checkout a repo, 1) update, 2) make your changes, 3) commit, see (1. It's pretty simple and I suspect that most people who have used SVN have agree. Given this mental model, when I started using git I really thought it of as SVN with another step, you do steps 1-3 above and add 4) push to remote to the list of steps. Like I said, same as SVN just with an additional step. Anyone familiar with git will tell you that if you are using that model you are missing out on some of the best and most powerful features of git, namely branching. While SVN provides the ability to make branches I never really got the sense that they are a first-class object in the same way that they are in git. To get a sense for why branches are important here is my work flow for a standard task that I am working on:

1. create branch
2. write failing tests
3. write code to make tests pass
4. commit
5. see 2 until task complete
6. merge branch back into master (smushing commits)

Anyone who is familiar with TDD will recognize steps 2 & 3 as the standard red-green loop but there are two other important things to note. If you are working in a tight red-green loop (which you should be) then step 4) is gonna result in a lot of commits which although the tests pass are probably not something you want commited into the record books forever and ever. In the standard SVN model you could either make all of these commits and spam the commit log with bunches of commits (and merging, ugh) or alternatively just not commit. The former is definitely undesirable and so is the ladder since its important to be able to go back to the last working state. The nice thing is that git allows you to get what you want in multiple commits which you can revert to while not spamming up the commit log. What you notice is that in step 6 when we merge back into the master branch we can either rebase (moves all the commits over) or my preference which is to merge without committing (smush commits). What this does is allow you to take all the commits you have made, smushes them together and then allows you to review what you have changed before you actually commit it into the master branch. For those of you wondering what I'm actually doing in step 6, assume you had your branch called "featureX" which had all the commits on it, you would checkout the master branch (git checkout master) and then do a merge using "git merge --no-ff --no-commit". Then you should have all the changes you made on the featureX branch staged ready to be committed, after you review it of course. It also helps that Macs have GitX which is one of the nicest diff tools I have seen to date.

One other awesome feature of Git that I use all the time (read: way too much) is stash. How many times have you made a change to something that you didn't want to commit but just wanted to keep around in case you needed it later? I seem to need this all the time, whether its simple config changes or some silly snippet I want to try out. If I want to keep that code around without having to commit it I can just do "git stash save some silly snippet" or more concisely "git stash" (the former adds a label to it). Then anytime something like that comes up you can just stash it away for later use.

So if you've been scared to make the switch from SVN to git, don't be scared, it wont bite.

Linux tools that make you go zoom

Over the past few years I have used Linux almost exclusively for programming and over that time I've gotten comfortable with a lot of the standard utilities. Want to find a file? Use find. Want to find a regexpn in a file? Use grep. It almost seems like a given. Until a little while ago I didn't even question whether there were alternatives since they are pretty much the standard go to tools and have been around forever. More recently I have begun working on larger source trees which I am less familiar with and as a result find myself using these tools a lot more frequently. Whereas previously waiting a bit was acceptable it quickly became a drag for me and I went looking for better alternatives.

I know it seems like a bit of heresy to replace the beloved grep but honestly ack makes it worth it. To show you the difference I'll do a quick search for the string "foobar" in a source tree that's ~500MB and ~9000 files.

Yeah. Seriously. The recursive grep on the directory took over 40 seconds to make it through while ack took under half a second. Anytime you get a two orders of magnitude speedup and its more convenient/fewer characters to type I find it hard to argue against that. The nicest thing about ack is that it automatically ignores all the standard version control files (e.g. .svn) which usually cloud up normal grep results with unwanted results. Of course all that is configurable. So how do you get this magic tool? Its simple:

If your a TextMate user you should be happy to know that there is a awesome bundle which allows you to find text in files really quickly called Ack In Project. I've been using it and found that the interface is awesome and much faster then the grep script I was previously using. The only downside to ack is that the default colors for it are pretty ugly but that can easily be changed via environment variables.

The other thing I frequently need to do is finding files in the file system. Its always something like "where is the config file for X again?" Since those things are frequently in the most bizarre of places I usually end up searching the entire file system which takes forever. Take a look...

Yep, find takes a whole 5+ minutes to get through the file system plus it spits out a bunch of permission denied messages where are really unnecessary in this context. On the other hand locate speeds through and finishes in a second and a half. Nice. The secret sauce is that locate uses a database of the file system the find things much quicker. Since its a standard linux utility and its doing it anyways, so why take advantage? It should be noted that locate searches the entire file system each time while find can search only relative paths so if you are searching a small folder find is usually faster while locate stays pretty constant.

While I'm a fan of the standard linux utilities, but anytime you can find solutions that really speed up things that you do frequently with little or no downside it seems almost foolish to not take advantage of. What are you waiting for?