Unknown Substitution Cipher Interview Question

You are given a file containing a list of strings (one per line). The strings are sorted and then encrypted using an unknown substitution cipher (e.g. a < c, b < r, c < d). How do you determine what the mapping is for the substitution cipher? The unencrypted strings can be in any language.

After a few questions the interviewer clarified and stated that the cipher was a simple substitution cipher where every letter in the alphabet is mapped to one and only one other character in the alphabet.

Given this problem the simplest solution (not counting brute force) would be to do frequency analysis. If the strings contain words or phrases in a popular language there is a reasonable shot that frequency analysis would at least get you started figuring out the cipher. After discussing this and how it would work the interviewer told me "you do frequency analysis and cannot draw any conclusions, what do you do then?"

Ah, back to the (metaphorical) drawing board. The next big clue is the fact that the strings are sorted. This type of interview question is generally worded very specifically so when what seems like in extraneous factor is "thrown in" it probably means something. In this case the fact that the strings are sorted tell you something, in particular, precedence of the letters. For example, given the following strings (which are sorted in ascending order)

ggcaa
ggpqr
grfzu

you can start to learn about the ordering of the encrypted letters. For example, in the above example we know that c < p from the first two strings and g < r from the second two strings. We can only use the first letter that differs from any two pairs of strings, everything afterward is inconsequential. Doing this over the entire file you would get a set of relationships (e.g. c < p, g < r, q < z, z < t). The next questions are then "what can you do with this data and what is the most appropriate method for storing it?"

This is where things get tricky. After going through some of the common data structures it seemed like a tree, in particular a binary tree, might be the right fit. The problem with a binary tree is that you don't necessarily have all the required information to create the tree. For example, say the root of the tree was the letter Z and it had a left child Q. Now say you are trying to insert the letter G. You may know that G < Z so it needs to be in its left sub-tree, however you may not know what the relationship is between G and Q. In going through the list of strings, you are not guaranteed to have all relationships between all letters.

Without all the relationships you are forced to fall back to a graph structure. Each letter becomes a node in the graph and each greater than relationship becomes a directional edge in the graph (g < y would have an arrow pointing from node g to node y). With the graph populated with all the edges, now what do you do? The interviewer assured me that each node is part of the graph (i.e. there are no partitions).

This leads to one of the tricky parts about graphs. Since all nodes are equal, where do you start? There is no "root" node like there is in a tree. In this case you want to start with the node with no incoming edges, this is the smallest letter. Such a node has to exist since there cannot be cycles in the graph. You can then delete that node from the graph and then repeat this process until there are no more nodes in the graph.

Now for the fun part, what is the complexity of this algorithm? Finding the smallest node in the graph is N complexity for the first node, N-1 for the second node, and so on which leaves you with N+(N-1)+(N-2)+...+1. This is can also be written as the summation from 1 to N of i which equals (N*(N+1))/2 (see here if you don't believe me). This means that the algorithm is O(n^2).


Enjoy.

NoSQL: The Paradox of Choice

I have been watching a lot of TED talks recently and while there are a ton of excellent talks, the one I want to talk about today is called "The Paradox of Choice" by Barry Schwartz (he has a book by the same title). While watching his talk it struck me how relevant the topic is to NoSQL databases. I want to talk about these connections and their implications.

More != Better

I want start with an idea that many people know intuitively but is still worth calling out explicitly. Having more options is not necessarily a good thing, in fact, in many cases its a bad thing. In his talk Schwartz cites a study which showed that for every additional 10 mutual funds offered by an employer, participation went down by 2%. As the number of options increases, the amount of time and effort required to make a decision increases significantly, to the point where you are put into a state of paralysis as a result of the myriad of options. For all intents and purposes the birth of the NoSQL database came with the publication of the BigTable paper by Google and the Dynamo paper by Amazon circa 2006. Since then, the number of NoSQL databases has gone from two, to five, to 29. That's right, http://nosql-database.org/ currently lists 29 different NoSQL databases each of which has slightly different feature set and benefit/drawback trade offs. Good luck picking the right one.

More Options => Higher Expectations

When many options exist it is only natural for us to expect that one of them hasto have the features you are looking for. With only one option it doesn't matter if it has the features you want since you don't have a choice. Prior to the NoSQL movement there was only one game in town and it was the relational database (MySQL/PostGREs/MSSQL), so you had no choice but to grin and bear it. However, now that there are almost 30 options (there probably will be by the time I'm done with this post) one of them has to have the right mix of peanut butter and chocolate to fit my tastebuds. Unfortunately this rarely turns out to be the case.

Higher expectations not only apply to features, but also performance. The promise of infinite scalability will draw a lot of eyes and ears, but in order to win them over you have to show users tangible benefits. I cant count how many blog posts I have read about people/companies who are using MySQL as a key-value store because it is faster then the key-value stores themselves. In order for NoSQL databases to win, users need to be able to benchmark the database and be impressed by its performance.

What have we learned from this? For all the hype it is getting, NoSQL is in a rough spot. The number of options is large and there aren't clear winners in any category. A few of the databases seem to be rising to the top (Cassandra for instance), but until enough people get behind a relatively small number of databases (I say pick 3) the knowledge, tool support, and amount of helpful information available online is going to be to painfully low.