Under A Boddhi Tree: 2014

Wednesday, December 3, 2014

A critique of python. Functional python to the rescue?

Perhaps my post about the OpenStack clojure tool may have given away some hints, but I am trying to distance myself from python. I believe that in the next 5-10 years, python will have much of its thunder taken by the likes of the new kids on the block: go, julia, and swift, and possibly even clojure or elixir.

I'm a big believer in "right tool for the right job". The trouble is that people think python is great for anything. I can already hear people sharpening their swords, but let me explain. Python can be used for just about anything. If a site like YouTube is mostly written in python, I think that goes to show its power. Also, look at OpenStack itself. A project with a million lines of code is nothing to sneeze at. However, python has its problems especially at scale. And I believe a big reason that big projects are written in python is because it's perceived as an "easy" language. Because it is "easy" that means there is a wealth of programmers to hire, and also it's very fast to churn out code. If code is easy to crank out, it seems logical to think that scaling should be easier as well. Another big problem with python is that even though it is "multi-paradigm" it is basically an OOP language with mutable imperative programming baked in.

The trouble is that the strength of python is also its weakness. Python is good at rapidly banging out some idea to see if it works. Due to its dynamically typed nature, you don't have to write tons of boilerplate code like you would with Java or even C++. If you want to extend the abilities of a class, no problem, monkey-patching to the rescue. And very likely, some other programmer(s) have written a 3rd party library for you to consume. However, do you see the problems inherent with these powers?

Because python is dynamically typed, when you first read python code, it is not at all obvious what kinds of arguments you are supposed to pass into a function. Docstrings I hear you say? Even if docstrings exist, all too often they are out of date or even plain old wrong. How many times have you had to crack open a debugger and figure out why your function wasn't working as expected because the argument that you passed in was the wrong type? I would go so far as to say that the time saved from not having to statically type your variables is lost during debugging. At least Python3 introduced argument annotations (which btw, is not necessarily type hinting), perhaps because they realized duck typing isn't always enough? Duck typing, while convenient, offers very weak guarantees if your object *should* be calling some function. And monkey-patching is not the best for production quality distributed software. How can you be sure that some monkey-patched new field or function won't clash with what another client is doing? Finally, while it may seem like having a ton of 3rd party libraries is great, python never solved what OSGi for Java did (or what the upcoming project jigsaw will do for Java).

Let's say you have module foo, and it depends on importing module baz version 1.2. Then you have module bar, and it too has a dependency on module baz, but at version 2.0+. Sorry friends, but you are SOL in python. This is because python has a flat PYTHONPATH much like vanilla Java does with its CLASSPATH. OSGi solved this by essentially writing custom class loaders and adding metadata to OSGi modules (and I believe some JVM bytecode hackery too).

The weakness above don't even factor into consideration other problems of scale. Take for example speed. Unless you are running pypy, python speed leaves something to be desired. But not every problem is speed insensitive. Some may say this is a moot point and "my program is fast enough", as Guido himself seems to think. Even Guido says the solution is to write the slow parts of your python code in C(++). Really? Okay, for the moment, maybe the python people are safe, because your competitors programs written in ruby, lua or python as well, will probably be about as fast. And, you think, the feature set you churn out will be far greater than those statically typed guys programming in Java, Scala, C++, Haskell, etc etc. There's no way they could churn out as much functionality in the same time frame as python right?

Ok, let's buy into that last argument (even though I don't think it is necessarily true). What if you start comparing the old world dynamic languages (ie python, ruby, perl, etc) with the new generation of languages? Say for example Julia, go, elixir or clojure? Julia is blazing fast, and clojure is no slouch either. Now, let's say your competitor has written his product in one of these languages, while yours is still using python? That speed equivalence across the old school languages is gone, and your product will suffer.

And speed isn't the only scaling problem python has. Python has a model for concurrency which requires, in essence, a distributed model of computing even on a single machine. Yes, I can hear the chorus of shouting now, "Stop blaming the GIL!!!". And if you think your answer is multiprocessing, this is not always the easiest of solutions. Firstly, multiprocessing on Windows is a pain due to the requirement that all arguments to the new multiprocess must be pickleable. Secondly, multiprocessing itself doesn't save you whatsoever from synchronization even though you may think you will be even if using Queues for serialization (to be fair, most languages don't make concurrency easy, though some new languages claim to fame is that they do make it quite a bit easier like clojure, erlang or the scala via the Akka library). And since the end of Moore's law is around the corner (I remember reading once that at the 12nm limit, the electron can jump out, and we're almost there), that means speed increases will come about either from some form of concurrency/parallelism, or from a new computer architecture, perhaps fiber optics or quantum computers.

When you start writing huge amounts of code, especially concurrent code, python's lack of having immutable variables is really head scratching. Yes, I understand tuples, strings, ints, etc are immutable, but there's overhead in subclassing from one of those types to create your own immutable type. Also, persistent data types only exist in 3rd party libraries like pyrsistent. That means you can't really be sure if your dict is the mutable or immutable kind. A lot of people don't understand the importance of having immutable data, especially if they have never written concurrent code. But I can't understand how python still doesn't have a simple way of creating a read-only constant. Globals aren't necessarily evil if they are read-only.

So, why do functional programmers rave about immutable persistent data structures? Imaging a scenario as simple as this:

sizes = [10, 20, 30]
configure_sizes(size)

What will sizes be after running configure_sizes()? As a client, you don't know what that function does to the sizes variable. Ok, I can hear a smart aleck say I should have used a tuple. But what if I had used a dict? If sizes = 100 (a simple non-compound type) and configure_sizes() changed it, you'd be upset wouldn't you? Why should compound types be treated differently? Let's look at another evil of mutable imperative OOP.

class Foo(object):
def __init__(self, x):
self.x = x
def multiplier(self, y):
return self.x * y

def modifier(foo_obj, newval):
foo_obj.x = newval

f = Foo(10)
result = f.multiplier(2)
modifier(f, 3)
final_result = f.multiplier(2)

In this case, Foo.multiplier() method output depends on hidden state for the output. This is bad, because it means that the end client cant ever fully be sure, given the input he passes in, what the result will be. This is why in functional programming, the only thing that the determines the output, is the input.

Python's crippled lambda also kind of sucks. Why should it be limited to a single line for the expression?

Is python a bad language? Definitely not. There are some very cool features like generators, coroutines, list and dict comprehensions, and decorators. But combine the inability to do true type hinting, lack of standard immutable data types, no easy way to do concurrency, and by default imperative style of programming leaves a lot to be desired. However, writing in a functional style of python is possible.

Toolz
toolz is a python package that expands upon the functional paradigm of using higher order functions. I highly recommend reading through the site, as they make a good case for functional programming in python. Toolz includes some useful functions that work with sequences. For example, they have a function called accumulate which is very similar to must other functional languages reduce.

Pyrsistent
pyrsistent is a set of immutable data objects that can be used in python. They give you immutable vectors (lists), immutable maps (dictionaries) and even immutable classes. Immutable data is of immense help during concurrent programming. If you have multiple threads acting on the same immutable (read-only) data structure, you can never have inconsistent results. Even when you aren't writing concurrent code, it is very useful because you don't have to keep track of state all the time. Functional programming highly stresses to always use lexically scoped, pure functions. Pure meaning that given the same inputs to a function, you will always get the same outputs. There is no hidden state (such as globals

hy
A lisp running on python? coool. Now, it's no clojure, but it might be possible to get halfway there by using hy in combination with toolz and pyrsistent. Although they don't have literals for certain data types, it should be possible because hy has reader macros (which clojure lacks btw).

Sunday, September 7, 2014

Wrapping my head around python asyncio

Trying to understand python's asyncio is a challenge. First, I personally don't know what is more difficult, multi-threaded programming, or event driven programming. Multithreaded programming has the difficulty of properly finding and eliminating race conditions and dead|live locks. Event driven programming has the difficulty of non-intuitive flow of control and many layers of abstraction and indirection. So where do we even start? If you just start reading the official documentation on asyncio, you probably won't get too far. Reading PEP 3156 won't get you much farther, though I do recommend studying both.

My main motivation for learning asyncio is probably a little unusual. I wanted to write something like pexpect, without using pexpect. In a nutshell, I wanted to interact with a child subprocess perhaps more than once. Python's subprocess module doesn't let you do this exactly, even though you may think the Popen.communicate() seems to. The problem is that it is a "one-shot" communication. You feed it one string and then you are done. But what if you need to answer multiple prompts from your child process?

So where can we start? I'm learning this too, so as I go, I'll introduce more examples. So let's make a small example of calling a subprocess using asyncio. I won't explain it in detail in this blog.

I will however explain briefly what coroutines are. In a nutshell, a coroutine is a way to factor out code that uses yield. The reason is that yield is "contagious". The very presence of yield in a function turns that function into a generator. So what would you do if you realize that some code you have that uses yield could be factored out into its own code? A coroutine can be spotted by its use of the new "yield from" keyword.

subproc_shell.py

1    """ 
2    Took the example from the tulip project and modified it to make it more like pexpect 
3    """ 
4     
5    import asyncio 
6    import os 
7    from asyncio.subprocess import PIPE, STDOUT 
8    import re 
9     
10    
11   @asyncio.coroutine 
12   def send_input(writer, input, que, regex): 
13       """ 
14       The coroutine that will send its input to the input stream (usually stdin) 
15    
16       :param writer: The stream where input will go (usually stdin) 
17       :param input: A sequence of bytes (not strings) 
18       :param que: an asyncio.Queue used to check if we have what we need 
19       :param regex: a re.compile() object used to see if an item from the que matches 
20       :return: None 
21       """ 
22       input.reverse()  # We have to reverse because we pop() from the end 
23       try: 
24           while input: 
25               item = yield from que.get() 
26               #print("Pulled from queue:", repr(item)) 
27               if item is None: 
28                   break 
29               m = regex.match(item.decode()) 
30               if m: 
31                   line = input.pop() 
32                   #print('sending {} bytes'.format(len(line))) 
33                   writer.write(line) 
34                   d = writer.drain() 
35                   if d: 
36                       # writer.drain() returns a generator 
37                       yield from d 
38                       #print('resume writing') 
39                   writer.close() 
40       except asyncio.QueueEmpty: 
41           pass 
42       except BrokenPipeError: 
43           print('stdin: broken pipe error') 
44       except ConnectionResetError: 
45           print('stdin: connection reset error') 
46       except Exception as ex: 
47           print(ex) 
48    
49   @asyncio.coroutine 
50   def log_errors(reader): 
51       while True: 
52           line = yield from reader.read(512) 
53           if not line: 
54               break 
55           print('ERROR', repr(line)) 
56    
57    
58   @asyncio.coroutine 
59   def read_stdout(stdout, que): 
60       """ 
61       The coroutine that reads non-blocking from a reader stream
62       :param stdout: the stream we will read from
63       :param que: an asyncio.Queue object we put lines into
64       
65       """ 
66       while True: 
67           line = yield from stdout.read(512)  # use this instead of readline() so we dont pause on a newline 
68           print('Received from child:', repr(line)) 
69           que.put_nowait(line)  # put the line into the que, so it can be read by send_input() 
70           if not line: 
71               que.put_nowait(None)  # A sentinel so that when send_input() pulls this from the que, it will stop 
72               break 
73    
74    
75   @asyncio.coroutine 
76   def start(cmd, inp=None, queue=None, shell=True, wait=True, **kwargs): 
77       """ 
78       Kicks off the subprocess
79       :param cmd: str of the command to run
80       :param inp: 
81       :param kwargs: 
82       :return: 
83       """ 
84       kwargs['stdout'] = PIPE 
85       kwargs['stderr'] = STDOUT 
86       if inp is None and 'stdin' not in kwargs: 
87           kwargs['stdin'] = None 
88       else: 
89           kwargs['stdin'] = PIPE 
90    
91       fnc = asyncio.create_subprocess_shell if shell else asyncio.create_subprocess_exec 
92       proc = yield from fnc(cmd, **kwargs) 
93    
94       q = queue or asyncio.Queue()  # Stores our output from read_stdout and pops off (maybe) from send_input 
95       regex = re.compile("Reset counter") 
96    
97       tasks = [] 
98       if proc.stdout is not None: 
99           tasks.append(read_stdout(proc.stdout, q)) 
100      else: 
101          print('No stdout') 
102      if inp is not None: 
103          tasks.append(send_input(proc.stdin, inp, q, regex)) 
104      else: 
105          print('No stdin') 
106   
107      if 0: 
108          if proc.stderr is not None: 
109              tasks.append(log_errors(proc.stderr)) 
110          else: 
111              print('No stderr') 
112   
113      if tasks: 
114          # feed stdin while consuming stdout to avoid hang 
115          # when stdin pipe is full 
116          yield from asyncio.wait(tasks) 
117   
118      if wait: 
119          exitcode = yield from proc.wait() 
120          print("exit code: %s" % exitcode) 
121      else: 
122          return proc 
123   
124   
125  def main(): 
126      if os.name == 'nt': 
127          loop = asyncio.ProactorEventLoop() 
128          asyncio.set_event_loop(loop) 
129      else: 
130          loop = asyncio.get_event_loop() 
131      loop.run_until_complete(start('c:\\python34\python.exe dummy.py', inp=[str(x).encode() for x in (3, 3, 0)])) 
132      loop.close() 
133   
134   
135  if __name__ == '__main__': 
136      main()

OpenStack clojure tool

So, I'm at Red Hat now :) I'm now a Quality Engineer working on OpenStack which is a new direction for me. This is the first time I have been working on a 100% software project. Well, that's not entirely true I guess. I did spend 18 months designing an automation framework from scratch. Nevertheless, this is new and interesting.

That being said, OpenStack is written almost entirely in Python. Namely Python 2.7. uggh.

Why the moaning? I used to be a big rah-rah python guy. A former co-worker of mine even jokingly wondered if Guido Van Rossum was paying me money to try to switch our company over to Python (which by the way, I pretty much single handedly did). However, over the years, I have come to find many pain points with the language that has considerably dimmed my enjoyment of the language. Now, hopefully I won't get any flames. Python isn't a bad language and it has quite a few interesting features. I just find myself longing for some things python lacks. And indeed, with some really interesting dynamic interpreted (or JIT'ed) system programming languages like Go, Julia, and even Swift, I really wonder how much wind is going to get taken out of Python's sails in the next 5 years? And that doesn't even factor in the non-system's programming languages like Clojure, Elixir or even TypedScript or LiveScript (a haskell-like variant of javascript).

Duck Typing ain't enough
I think Python 3 came to this realization with their new argument annotations, and so Python 3 doesn't suffer from this problem like Python 2 does. Type hinting is the way to go. It allows the developer to rapidly prototype an idea, and then for performance or documentation reasons, type the variables and return code later. It would be nicer if Python was like Julia (or TypedClojure) and allowed even locals to be optionally typed.

When you start getting code bases into the many tens of thousands or more of code, you just look at a function and wonder "ok, what kind of variable am I supposed to pass in?". Some of you may be saying that's what a good docstring is for. I would agree, except that we all know the first thing to bit-rot is documentation.

Moreover, duck-typing can lead to unintended problems. Perhaps you want to pass in an object that supports the method quack(). Unfortunately, the user so happens to pass in a BadDoctor class object, and your function happily calls the quack() method for you.

Hard to make constants (immutables)
Basically, if you want to truly make an immutable object in python, you'll need to subclass from int, tuple, string or some other immutable built in type. And it's a little odd to do so. It's one of the few places that implementing __new__() is required. I often tell people that python's __init__ is not the constructor, __new__ is. It is __new__ that actually allocates the memory for the object, and __init__ initializes the allocated memory. If you have an immutable object, you have to give it a value as soon as it is created.

Another way to make "read-only" objects is to use a setter property. It's not fool-proof, but it does allow one to make a mostly read-only object. You could also reimplement __getattr__ and __setattr__ for the class and have it look up what you are trying to access. And lastly, you could write a C(++) module for the data structure which does have const. But really, would you want to do that?

Performance
Pypy aside, python's performance leaves something to be desired. It also seems that Guido is totally nonplussed by python's performance and thinks it's good enough. I was quite startled to recently learn how good the V8 Javascript engine performs. That's not bad at all, and would make it on average about as fast as PyPy. But Openstack requires regular CPython, mainly because of lots of dependencies on modules that use C modules (when you install OpenStack from devstack or packstack, you'll see some source compilation going on).

The Browser as the new VM
Like it or not, the browser is kind of the new VM. That means that javascript is becoming as important as C or Java and just as ubiquitous. Having an application that can run virtually anywhere, including mobile devices is not to be scoffed at. Also, I was surprised to learn the new tricks HTML 5 has up its sleeve. This includes a File System API so that you can finally read local files (albeit to a sandboxed file system), the websocket API, WebGL, and drag and drop support just to mention a few. Since javascript is the de facto language of the browser, that means for better or worse learning javascript. There are quite a few python-to-javascript libraries out there, including pyjs, and brython. However, they are not developed by the core python team and so I wonder if/when support will end? And brython only supports python3.

No persistent data structures built-in
So there is pysistence. But being a non-standard 3rd party library and with the lack of data-typing, it means that users will not be sure when persistent or non-persistent data types are being used. Why do we want persistent data structures by default? This page and this one sum it up pretty well.

Lack of good concurrency
Python, thanks to the GIL, doesn't really have true concurrency. There is multiprocessing, which fires up a new python interpreter, but it does have some limitations (like the arguments must be pickle-able on Windows) which can be a real pain. Also, since a new python interpreter is getting fired up for each new multiprocess, python developers can't really laugh at the JVM's large consumption of memory once you start firing up 20+ processes. Hopefully pypy will solve this problem with Software Transactional Memory.

Projects to work on

So a few days ago I came up with a list of some fun things to work on to expand my knowledge and proficiency. I'm putting the language design on the backburner just because there's so much you have to know to build your own language. That being said, I will learn more about language design via another route. So here's a list of some things I could work on:

Data structures: Trie, Persistent RB Tree, B+ Tree (julia, clojure).
Build a simple filesystem (julia, clojure, rust?)
Convert Machine Leairning code (julia, clojure)
Build a graph database (jula, clojure)
Build messaging framework (julia, python)
Build a parser combinator (julia, clojure)

If you'll notice, all these projects are what you would find in a typical computer science class. As I've explained in other blog posts, one's foundation has to be strong. IMHO, this is the difference between being a programmer or developer, and being an engineer (or scientist).

Having a good handle on some of the more "exotic" data structures could be useful. Trie's have some interesting properties (and a modified version called a hash array mapped trie is what powers clojure's map type). Red Black trees are used in linux's fair scheduler, and are useful wherever it is desirable to have balanced trees. I started implementing an RB tree in C++ to help me learn C++11, but I'm going to do this in either julia or clojure instead. I might even attempt a B+ tree, though the little I have looked at them, they look quite complicated.

I could use some of the data structure knowledge to help with filesystems. Lately, I've been getting interested in Big Data. Not just the analytics, which is cool in itself, but in the implementation of Big Data. For example, you have to store all that data somewhere. One of the few places in the linux kernel with complicated data structures is in the file system. Lately, there's been a lot of interest in software defined storage systems. It isn't just computers that are getting virtualized....you have VLAN's and software storage too. Distributed software based file systems like HDFS, Glusterfs and Ceph have been gaining a lot of interest. It would be interesting to see what it would take to build a file system.

While filesystems can enable Big Data, the other side of course is in analyzing that data. One of the first things that got me into Computer Science was artificial intelligence. It's one of the few fields that is math heavy (other than 3d programming or scientific computing) and that alone makes it interesting. Everything in AI from support vector machines, decision trees, bayesian classification, markov models, neural networks and genetic programming is fascinating. Given that julia was tailor made for scientific computing, I think it would be very interesting to convert the code samples in Machine Learning in Action book from python to julia.

Somewhat related to data structures and filesystems are graph databases. A graph can simulate any other data type (though not with the same runtime or space complexity), from arrays to hashes to trees. Ironically, they are better at modeling many kinds of relationships than traditional SQL databases, and much better than document DB's like couchdb or mongodb. Graph databases would require a filesystem (for persistence), a hypergraph data structure, algorithms to walk the nodes (graph theory), and a strategy for ACID.

Messaging frameworks are a necessary component of any medium sized or above complexity project. They are replacing more traditional RPC style communication styles due to their ability to be asynchronous and to do interesting things like pubsub or fanout communication. With the proliferation of virtual machines and even virtual storage, data has to be shuttled across networks more than ever. At first glance this doesn't seem very Comp. Sci related, but the key is in the implementation. Using coroutines and asynchronous IO is the key to this problem.

The parser combinator may seem like an odd choice, but think about what is required to build or design one. It requires understanding lexing, scanning, and therefore building DFA's. There's also the aspect of how to make this functional in design, so that you can chain them together to use it as a recursive descent parser. Writing a parser combinator will to a large degree be a stepping stone to writing my own language.

So why will I try to implement almost everything in either julia or clojure? Clojure is starting to gain momentum with other companies and while I think most Java developers would rather graduate to Scala since it's a little more familiar, I think clojure will find its own niche. For example, Puppet Labs recently ported some of their technology to clojure from ruby. There's also a project at Red Hat called Immutant which allows you to write cloud apps in clojure. And jclouds supports not only java as a first class citizen, but clojure too for your open cloud needs. Julia is a very intriguing language with a lot of potential that I will shortly write a blog on. The devs actually recommend learning julia by looking at the source code for julia itself. This may also give me some insight into how LLVM works, since julia uses LLVM as a jit compiler.

Friday, January 3, 2014

How functional programming has motivated me to relearn math

Ok, I will admit it. The first thing that attracted me to learning a lisp dialect was the coolness factor. There's a saying about being a Smug Lisp Weenie. To wit, we have this chart that I first saw several years ago:
http://lukewelling.com/wp-content/uploads/2006/08/programmer%20hierarchy.pdf

Actually, this chart was made by someone who only works with "pure" software programmers. Whomever made this chart must have never run into the HDL (eg Verilog or VHDL) guys, as they would most prominently have been at the top. Nevertheless the point is well made that other programmers feel that Lisp programmers feel they are superior to everybody. Just read some of the reviews on Amazon.come for the classic SICP book by Abelson and Sussman and see how some people feel about lisp (or scheme) programmers.

I felt like there must be something about Lisp to inspire such a love/hate dichotomy. For those who hated it, was it just a case of sour grapes? Those who couldn't "get" lisp simply derided it as a defense mechanism? My first experience with lisp was...well, confusing. We had to do a little bit of lisp programming at my university's Artificial Intelligence class. To say I had no clue how lisp worked was an understatement. I mean 'car' and 'cdr'?

Regardless, a few years later (ca 2010), there was this new language Clojure that everyone was talking about. I also recall reading around that time this hilarious article http://www.secretgeek.net/lisp_truth.asp. I remember busting out laughing over and over, even though I hadn't yet grokked any lisp dialect. Surely, a language that could inspire a page as funny as that had to have something go for it. Not to mention that lisp is the 2nd oldest continuing language around. So I decided to dip my toes into the Clojure waters and see what all the fuss was about.

It was hard. Real hard. I mean, how can you have a language without assignment? Without for loops? Where there isn't even technically a variable (as the oxymoron "constant variable" would surely indicate)? It quite literally took me months to wrap my head around this bizarre concept. To be honest, I think a good deal of the difficulty of the language wasn't so much around the "lispiness", it was was around the functional aspects. Lisp and Scheme aren't pure functional languages (technically, neither is Clojure, but it is moreso than either Lisp or Scheme as all data is immutable and most sequences are lazy). The idea that lisp is stupid because of all the parenthesis is....well stupid. In fact, it's what makes lisps so beautiful and magical. Lisps are already almost in their raw abstract syntax tree. There's also very little syntax required to learn.

But what does all this have to do with math? Functional programs are very closely related to the idea of mathematical functions. The Fortran and Algol descended languages have at their core the idea of a Von Neumann machine. The language they model closely resembles the hardware it runs on. This imperative approach of modifying registers or memory is basically what Turing machines are about. If i ever make gobs of money, I will try to buy an old lisp machine.

But I digress. Lisp was really designed as a symbolic evaluator, and rather than follow the Turing/Von Neumann model of computation, it took the lambda calculus as derived from Alonzo Church as its basis for computation. Although turing machines and lambda calculus are functionally equivalent (they have the same computational power), their expressiveness is worlds apart.

I recently tried looking at some papers and books on denotational semantics and the lambda calculus, and I realized just how much math I have forgotten. So, as bathroom reading material (yeah yeah...too much info), I have my old discrete math and formal languages and automata theory books that I can peruse. I'm amazed how much I have forgotten. Things like modular arithmetic, bijections, counting (set theory), equivalence relations and so on.

This is the heart of computation. While most software engineers are content with learning the new fad of the year (hmmm, hadoop seems popular now, or maybe I better learn objective-c so I can write a bright and shiny iPhone app), I am more fascinated by the foundation of computation itself. But it's also hard. In all honesty, learning some new library is really not that hard. What interests me more than programming with hadoop...is the science behind hadoop itself.

And it all starts with math, and lots of it. It's not easy. I remember Discrete Math to be insanely hard. I had to retake linear algebra and abstract algebra. Automata Theory wasn't a walk in the park either. And learning all this math helps a software engineer design better algorithms or come up with more novel ways to solve a problem.

Now that the holidays are over, I'm going to resume my data structures project, and maybe tinker around with some other CS fundamentals.