Sunday, June 9, 2013

A little foray into induction and persistent data structures.

Part of the reason I am working on the C++ Data Structures and Algorithms project is to get back to the basics and master them.  Some recent setbacks have caused me to re-evaluate my knowledge of Computer Science fundamentals.  My two previous posts on C++11 and data structures is going to be an ongoing commitment from me in this regards.

Unfortunately, working at the hardware level has eroded a lot of my computer science fundamentals.  In the linux kernel for example, you almost exclusively work with linked lists.  The scheduler uses red black trees for fairness, but I have never had to touch the scheduler.  And for some surprising reason, there is nary an associative array (hash, dictionary, map, etc) to be found unless you're working with the filesystem.  And in firmware, good luck.  You're pretty much stuck with arrays (used as stacks, queues or ringbuffers) and maybe some binary search trees or a priority queue.  The only time in my 6.5 year career where I've extensively worked with more advanced data structures was in creating a dependency resolver for installing software for test automation.

While my basics in Computer Science might have eroded, I see many otherwise smart people who lack curiosity.   It is my curiosity that drives me to learn new things. I think curiosity is something you either have or don't.  Smart people might pick up a skill more quickly, but few I think have the drive to learn more and dig deeper.  I am constantly wanting to learn new skills, and perhaps that is my downfall as it tends to distract my attention.  That is why I am going to stick to getting these data structures and algorithms down.

I recently had two little confidence boosters.  The first was solving a problem while reading a Python book on Algorithms by Magnus Lie Hetland.  At the end of a section on Induction, it asked to solve Euler's Formula regarding vertices, edges and faces using induction.  The equation is:

V - E + F = 2

V and E are vertixes and edges respectively, while F is a face (or field).  You can think of a field as "space" captured by edges and the "universe" outside all the vertices and edges.  Take for example this simple graph:



It has V = 2, E = 2 and F = 2 (the "universe" field, and the field captured by the space partitioned by the edges/vertices).  And as you can see, Euclid's equation holds (2 - 2 + 2 = 2).  But how do you prove that inductively?  The equation V - E + F = 2 doesn't look like your typical equation since there are 3 variables!!

Or are there?  If you increase the vertex count by one, you must also increase the edge count by one.  Adding one vertex and one edge does not increase the field count because it is not creating a new one.  Therefore we are adding one vertex but canceling it with a new edge:

V - E + F = (V+1) - (E+1) + F = 2



What if we add a new edge instead of a new vertex?  If a new edge is added, then we also must create a new field.  Again, we are subtracting an edge but adding a field:

V - E + F = V - (E+1) + (F+1)


And since there is no way to add a field without adding an edge, we have all our bases covered.  I was actually wracking my brain on this one for about 30min and decided to let it rest.  I was cooking dinner when I just had that eureka moment and was like, "oh yeah, that's the answer".  I don't know of any engineer or scientist who has not had one of these Archmidean Eureka moments.  It's kind of freaky when you think about it...the answer just comes out of nowhere.  While these "out of the blue" answers are ultimately a good thing, it sucks during job interviews when you can't think of an answer and then it only comes to you later.

My second confidence boosting moment was when I realized one way to create a persistent tree structure.  Persistent data structures are more common than you may think.  Take for example a revision control system like git.  You can envision your code edits as nodes in a tree.  When you create a new branch, you aren't just  creating a "branch on a tree".  What you are really doing in fact is replicating the old tree, and sticking a new branch on the new tree.  Why do that?

If you mutate your original tree, how do you go back?  What if you delete a branch far up in the "arm" of the branch, but later, you wanted a twig hanging off the branch?  If you have a mutable tree, it can't be done because the branch containing the "twig" is gone.  So how do you solve this?  What you do is you create a "new" tree, and it is this new tree has new branches added or pruned off.  When you do your merge, you "graft" or 'prune' the branches from this "new" tree to the original.  I had a lot of trouble grokking functional programming and immutable data structures until I read an analogy in the book The Joy of Clojure by Michael Fogus and Chris Houser.  They analogized functional programming languages with immutability to those flip books that you see to create the illusion of movement.  In immutable languages, every page is a "version" of the data.  In mutable languages, you have one piece of paper, and every time you want to change something, you have to erase the picture and start over.  There's no "versioning".

Now, that might sound like a huge performance hit both in terms of speed and size.  Creating a duplicate tree wastes space not to mention copying a whole bunch of stuff around.  But that's a naive approach.  If you think about it, you don't have to duplicate the entire tree.  If you think about inserting a node into a tree, what  do you first have to do?  You first have to traverse the tree to find the node insertion point.  By doing this, you have already walked a branch of the tree...and that's all you need to duplicate (in order to get to the new node to insert).  In fact, you don't even have to "duplicate" this branch (which is for all intents and purposes, a linked list of nodes).  In other words, you don't need to copy the branch, you just need to make pointers to the nodes making up the branch.

You can create a (shared) pointer for each of the nodes you have walked through.  Once you get to the node where you will insert the new node, you create two new nodes.  You create a new node for the insertion point, and a new node for the actual node to be inserted.

This requires log n new shared pointers (not actual node objects, just pointers to the nodes), and requires creation of just 2 new nodes (versus mutating one for a mutable tree).  This is much better than duplicating an entire new tree of n nodes.  You now have 2 trees, the original and a new tree which is exactly the same as the original with the exception of the new node.  And now you get all the advantages of an immutable and persistent tree data structure.  Look at this graphical example:




Basically, what you are "duplicating" is a linked list that goes to the newly inserted node.  The green circles represents new pointers to the already existing objects.  Now you have two "heads", one in blue, and one in green.  The blue head does not have a node with a value of 11, but the green head does.  Since this was a small tree, the number of new nodes is fairly large in proportion.  On average, you will only need log n + 1 new nodes.  I have read that there is a way to implement this with O(1) new nodes, but I have not studied this implementation.

No comments:

Post a Comment