Tuesday, June 17, 2014

Projects to work on

So a few days ago I came up with a list of some fun things to work on to expand my knowledge and proficiency.  I'm putting the language design on the backburner just because there's so much you have to know to build your own language.  That being said, I will learn more about language design via another route.  So here's a list of some things I could work on:

  • Data structures: Trie, Persistent RB Tree, B+ Tree (julia, clojure).
  • Build a simple filesystem (julia, clojure, rust?)
  • Convert Machine Leairning code (julia, clojure)
  • Build a graph database (jula, clojure)
  • Build messaging framework (julia, python)
  • Build a parser combinator (julia, clojure)

If you'll notice, all these projects are what you would find in a typical computer science class.  As I've explained in other blog posts, one's foundation has to be strong.  IMHO, this is the difference between being a programmer or developer, and being an engineer (or scientist).

Having a good handle on some of the more "exotic" data structures could be useful.  Trie's have some interesting properties (and a modified version called a hash array mapped trie is what powers clojure's map type).   Red Black trees are used in linux's fair scheduler, and are useful wherever it is desirable to have balanced trees.  I started implementing an RB tree in C++ to help me learn C++11, but I'm going to do this in either julia or clojure instead.  I might even attempt a B+ tree, though the little I have looked at them, they look quite complicated.

I could use some of the data structure knowledge to help with filesystems.  Lately, I've been getting interested in Big Data.  Not just the analytics, which is cool in itself, but in the implementation of Big Data.  For example, you have to store all that data somewhere.  One of the few places in the linux kernel with complicated data structures is in the file system.  Lately, there's been a lot of interest in software defined storage systems.  It isn't just computers that are getting virtualized....you have VLAN's and software storage too.  Distributed software based file systems like HDFS, Glusterfs and Ceph have been gaining a lot of interest.  It would be interesting to see what it would take to build a file system.

While filesystems can enable Big Data, the other side of course is in analyzing that data.  One of the first things that got me into Computer Science was artificial intelligence.  It's one of the few fields that is math heavy (other than 3d programming or scientific computing) and that alone makes it interesting.  Everything in AI from support vector machines, decision trees, bayesian classification, markov models, neural networks and genetic programming is fascinating.  Given that julia was tailor made for scientific computing, I think it would be very interesting to convert the code samples in Machine Learning in Action book from python to julia.

Somewhat related to data structures and filesystems are graph databases.  A graph can simulate any other data type (though not with the same runtime or space complexity), from arrays to hashes to trees.  Ironically, they are better at modeling many kinds of relationships than traditional SQL databases, and much better than document DB's like couchdb or mongodb.  Graph databases would require a filesystem (for persistence), a hypergraph data structure, algorithms to walk the nodes (graph theory), and a strategy for ACID.

Messaging frameworks are a necessary component of any medium sized or above complexity project.  They are replacing more traditional RPC style communication styles due to their ability to be asynchronous and to do interesting things like pubsub or fanout communication.  With the proliferation of virtual machines and even virtual storage, data has to be shuttled across networks more than ever.  At first glance this doesn't seem very Comp. Sci related, but the key is in the implementation.   Using coroutines and asynchronous IO is the key to this problem.

The parser combinator may seem like an odd choice, but think about what is required to build or design one.  It requires understanding lexing, scanning, and therefore building DFA's.  There's also the aspect of how to make this functional in design, so that you can chain them together to use it as a recursive descent parser.  Writing a parser combinator will to a large degree be a stepping stone to writing my own language.

So why will I try to implement almost everything in either julia or clojure?  Clojure is starting to gain momentum with other companies and while I think most Java developers would rather graduate to Scala since it's a little more familiar, I think clojure will find its own niche.  For example, Puppet Labs recently ported some of their technology to clojure from ruby.  There's also a project at Red Hat called Immutant which allows you to write cloud apps in clojure.  And jclouds supports not only java as a first class citizen, but clojure too for your open cloud needs.  Julia is a very intriguing language with a lot of potential that I will shortly write a blog on.  The devs actually recommend learning julia by looking at the source code for julia itself.  This may also give me some insight into how LLVM works, since julia uses LLVM as a jit compiler.