Unknown Substitution Cipher Interview Question

You are given a file containing a list of strings (one per line). The strings are sorted and then encrypted using an unknown substitution cipher (e.g. a < c, b < r, c < d). How do you determine what the mapping is for the substitution cipher? The unencrypted strings can be in any language.

After a few questions the interviewer clarified and stated that the cipher was a simple substitution cipher where every letter in the alphabet is mapped to one and only one other character in the alphabet.

Given this problem the simplest solution (not counting brute force) would be to do frequency analysis. If the strings contain words or phrases in a popular language there is a reasonable shot that frequency analysis would at least get you started figuring out the cipher. After discussing this and how it would work the interviewer told me "you do frequency analysis and cannot draw any conclusions, what do you do then?"

Ah, back to the (metaphorical) drawing board. The next big clue is the fact that the strings are sorted. This type of interview question is generally worded very specifically so when what seems like in extraneous factor is "thrown in" it probably means something. In this case the fact that the strings are sorted tell you something, in particular, precedence of the letters. For example, given the following strings (which are sorted in ascending order)

ggcaa
ggpqr
grfzu

you can start to learn about the ordering of the encrypted letters. For example, in the above example we know that c < p from the first two strings and g < r from the second two strings. We can only use the first letter that differs from any two pairs of strings, everything afterward is inconsequential. Doing this over the entire file you would get a set of relationships (e.g. c < p, g < r, q < z, z < t). The next questions are then "what can you do with this data and what is the most appropriate method for storing it?"

This is where things get tricky. After going through some of the common data structures it seemed like a tree, in particular a binary tree, might be the right fit. The problem with a binary tree is that you don't necessarily have all the required information to create the tree. For example, say the root of the tree was the letter Z and it had a left child Q. Now say you are trying to insert the letter G. You may know that G < Z so it needs to be in its left sub-tree, however you may not know what the relationship is between G and Q. In going through the list of strings, you are not guaranteed to have all relationships between all letters.

Without all the relationships you are forced to fall back to a graph structure. Each letter becomes a node in the graph and each greater than relationship becomes a directional edge in the graph (g < y would have an arrow pointing from node g to node y). With the graph populated with all the edges, now what do you do? The interviewer assured me that each node is part of the graph (i.e. there are no partitions).

This leads to one of the tricky parts about graphs. Since all nodes are equal, where do you start? There is no "root" node like there is in a tree. In this case you want to start with the node with no incoming edges, this is the smallest letter. Such a node has to exist since there cannot be cycles in the graph. You can then delete that node from the graph and then repeat this process until there are no more nodes in the graph.

Now for the fun part, what is the complexity of this algorithm? Finding the smallest node in the graph is N complexity for the first node, N-1 for the second node, and so on which leaves you with N+(N-1)+(N-2)+...+1. This is can also be written as the summation from 1 to N of i which equals (N*(N+1))/2 (see here if you don't believe me). This means that the algorithm is O(n^2).


Enjoy.

0 comments:

Post a Comment