Sunday, October 27, 2013

What to study...real world pragmatism vs. haskell, scheme and CS fundamentals

So there's a lot of stuff to catch up on mainly because I've been very busy.  I switched jobs back in August and moved to Salt Lake City, Utah.  I went back to being an Automation/tools engineer rather than doing linux driver development.  Why?  I'll go over that in another post as it deserves its own topic.

Currently, I'm mostly doing some heavy duty and advanced python (metaclasses, lots of decorators, import machinery hooks, and an asynchronous socket based message system), and pretty soon I'll be working on some C++ code as well.  I've been working on some messaging frameworks too (MOM).  And lastly, I'm still plugging away at some computer science fundamentals.  I haven't touched my data structure project in awhile due to the business of moving and switching jobs, but I intend to get back on it.

In fact, I've been reinspired of late by reading a couple of books.  The first is the classic SICP.  Although I downloaded a pdf version, it's only the last few days I actually started reading it.  I finally got to the part about Normal Order evaluation versus Applicative Order evaluation and I had never considered that before.  In addition to the data structures project I have in C++, I figured I'd start working through the SICP problems and solve them either in Racket or Clojure.

While learning more about Scheme, I came across an interesting tutorial/book called  "Write Yourself a Scheme in 48hours" which is a guide on how to make a scheme in haskell.  I started reading that after finding Real World Haskell and Learn You a Haskell online.  I just started getting into Haskell a few days ago, so the syntax is a little odd to me still.  Nevertheless, I found it less confusing than 2 or 3 years ago while looking at it.  Haskell seemed to come up a lot while I was reading different forums or blogs on Clojure as there seemed to be quite a few comparisons.  I still don't quite understand Monads, but what I have seen so far is pretty interesting.

The idea of writing a Scheme in Haskell intrigued me more and more.  It seems like there is no language with the set of features that I'd like:

  • immutable data structures by default (not in Racket)
  • "native" continuations (not in haskell or clojure)
  • lazy evaluation by default (not in clojure, possible in Racket)
  • STM implementation (AFAIK, not in Racket)
  • lisp style macros and homoiconicity (not truly available in haskell)
  • C(++) ABI compatibility (none with c++, hardest to do in clojure )
The last bullet point is in all honesty, probably the hardest, but should be something allowable by using LLVM to build the language.  Building a Scheme in LLVM would be no small undertaking.  C++ is a hard enough language by itself, and then learning compiler theory and the LLVM libraries on top of that would be quite a challenge.  I could in theory use Boost.Spirit (which is a C++ parser combinator library) to build the lexer/parser part for the language, but there's a part of me that wants to learn how to make my own parsers.  But if I'm willing to forego that last bullet point, I could make a Scheme variant written in Haskell, and it would have the best of both worlds (pure functional procedures and data structures, lazy evaluation, STM, and homoiconic lisp style macros).

While all of this stuff is fascinating, in the back of mind, there's a nagging little voice that keeps saying, "but is this going to help your career?".  Even though all these topics are fascinating, I do sometimes wonder if I wouldn't be better off learning Ruby, Hadoop, some web skills or brushing off my Java.  Afterall, that's what employers want.  As I've heard many managers say, they want a candidate that can "hit the ground running".  This is a sad and unfortunate failure of a corporate culture to think this way.

Literally the last day that I was working at LSI (my old company), I got an email from a recruiter at Google saying he would like to do an interview with me.  I was tremendously flattered (not to mention mad at the universe for the quirk of fate that the opportunity didn't present itself a month earlier).  About a week later, I got a similar offer for a job interview at Amazon and Facebook both.  Although these opportunities were amazing, even had I still been searching for a job, I knew I was not ready for none of these technical giants.

I am not a bad software engineer, but I also believe I lack some strength in fundamentals.  What do I mean by that?  Software engineering is about more than knowing the intricacies of some programming language, or being a guru at some popular framework or library.  IMHO, there's a vast gulf between being a programmer, and being an engineer.  Programmers can churn out code.  It can even be well thought out and planned code.  But it takes a different mindset to know _how_ to tackle a problem in the most efficient and maintainable manner possible.

For example, what are the run time or space time constraints for some algorithm?  Given the choice of using some data structure, how will multiple concurrent threads access that data structure?  How do you deal with constantly changing inputs to a program, do you write a regex, a parser, or just rewrite your wrappers everytime?

There's a reason Computer Science majors study the basics of computation.  In fact, while studying data structures and algorithm analysis is interesting and useful in the real world, there's still far more to it than that.  For example, Graph Theory is heavily used to solve many problems, from version control systems to finding paths in a complex data structure.  Perhaps it may seem a waste of time to learn about finite automatas, but without them, how do you build regular expressions?

Some might argue that the hard work has already been done.  Just grab a PCRE library or some built in library or data structure in your language and do some real work.  I've even heard some managers complain of a "research project" mentality.  The essence being, if you're not coding, designing, scoping or documenting, you're not working (ie, there's no room for research).  I've had quite a few engineers tell me I have a much different philosophy than they do about solving problems.  My philosophy is that if I already know a way to solve some problem that would take one hour to implement, but I heard that there's another possible more elegant or more efficient way to solve the problem, but it would take one day to learn and implement, I'd much rather take the day to implement it in the more efficient or elegant way.

Now, the trick is, where's that cut off in terms of practicality/pragmatism and engineering elegance?  Management would usually say it's better to implement the rough way first, and then improve it later.  Afterall, that's the Agile mantra isn't it?  Get something working first, and then make small modifications over time to improve it.  The problem is, that is a false dichotomy.  The truth is, you rarely have time to ever go back and improve upon a solution.  So the old crufty but usable implementation is what stays forever more.  As my mom and grandfather used to drill in me, "if you are going to do something, do it right the first time".  But that doesn't mean you have to build some grand cathedral or not at all.  You can still spend the extra time to learn the better algorithm but implement it in a simple style.

As a simple example, you might be tasked with searching for a value in some data structure.  The simplest solution is to create an array and then just search element by element.  Simple right?  But why not store it in a binary search tree instead?  Just a tiny bit more difficulty, but now instead of a O(n) search time, now you have O(log n) search time.  A self-taught programmer might not have any clue about binary search trees (I know I didn't when I was teaching myself C++).  For a more real world example, if you have a relatively complicated piece of a formally structured data, do you solve the problem with a regex or with some other parser?  Take for example the typical example of the get long options in C (and its variants in other languages).  In python for example, you might have "-f 10", "--foo 10" or "--foo=10" (and it might even have a default value so it may be missing entirely).  So how do you come up with a way to verify all the command inputs are valid, or to programmatically come up with all of them?

There is perhaps a darker side to that nagging little voice in the back of my head that tells me I should be studying some popular web framework.  On the one hand, I would eventually like to move back closer to my family in Florida.  The problem is, Florida's job market is pretty much either as a web guy (back or front end), a database guy, or something in the military industrial complex.  While none of those are particular bad, I don't have the skills in web or database technologies, and the military industrial complex is very fickle from what I have heard.  The other insidious little aspect of this annoying voice is that learning frameworks and libraries is the easy part.  From my experience, it just takes time and some practice to learn some framework.  But learning the fundamentals of CS is _hard_.  In fact, I've never been as challenged in my work career as I was in school.  I've often found that an interesting conundrum.  Why isn't work as hard as school?

So despite that little voice, I will continue on training my fundamentals.  I remember many years ago, I was taking a class in Choy Lay Fut, a "modern" kung fu style that was amalgamated from several other styles during the mid 1800s.  For two months, literally, all I did was stances; horse stance, bow stance, cat stance, etc.  But I knew why the sifu was doing that.  Without a solid foundation, it is easy for your opponent to take advantage of you.  Everyone wants to get to the kicks and punches and blocks, but without a solid grounding, it's all useless.  As another martial arts analogy, I remember while taking Aikido, we were all told to breathe in, relax and in a slightly low posture, the sensei would come by and gently push on the center of our sternum.  It was easy to resist.  He did this several times, and then one time, he very lightly tapped us on the forehead, and a feather weight touch to our sternum toppled us very easily.  Lesson learned....your foundation is all important, and don't let anything distract your mind.

Saturday, October 26, 2013

Something like OSGi for python

Ok, so I have been knee deep into python again for my new job as an automation engineer at Fusion-io.  Although python is a nice language for small to medium sized project, when you start getting into the middle tens of thousands of lines of code, python's "convention over bondage" philosophy starts to wear a little thin.

One problem I have been thinking of tackling is creating an OSGi like component based framework for python.  If you aren't familiar with OSGi for Java, it's a modularity system for Java that among other things, helps resolve CLASSPATH issues, versioning problems between jars, dynamic lifecycle of modules, and helps design more modular and reusable code.  Perhaps you are thinking that python already has modules and packages, and this isn't necessary.  But hold on, let's examine this little problem in more detail.

It's quite common in the Java world to hear the expression, "program to an interface, not an implementation". Java, lacking multiple-inheritance, has Interfaces to help resolve that problem.  But an interface's true strength is that it enforces the set of behavior that something should expose.  The idea is that the implementation should be a black box to the client programmer...he just wants the functionality exposed from the service.

You might for example have many different ways of retrieving an object over the net.  You might use FTP, or SFTP, SCP, http, or one of many ways.  Even amongst these choices there are subchoices.  Do you use python's built in ftp library?  Or do you have some ftp client utility that you want to wrap using subprocess?  Do you see the problem now?

The implementations change, but they all have a set of common functionality; the interface.  That is what you program to...the interface...not the module or class that exposes the concrete implementation of some functionality.  Being able to code against an interface allows you to swap out the underlying implementation.  Perhaps you are wondering why you would need to do that?

It is a sad fact that requirements are always changing or software breaks.  Perhaps some tool you are using changed the output and you were screen scraping, or the function defintion to some remote API you were using changes.  By decoupling the interface and the implementation, it eases these kinds of problems.  It also allows you to create an extension or plugin based system.

I can already hear some people say that python has no need for a plugin system.  Its dynamic nature makes plugins a "built in" feature of the language.  That's true, but it's also like the wild west.  Even python itself realized this with the creation of abstract base classes.  By creating an ABC, it allows a python developer a way to specify that a subclass must override and implement a set of methods.  If you use monkey patching to "plugin" new functionality, there's no way to guarantee that whatever got monkey patched is honoring the interface's contract.

Let's consider a more concrete problem.  Look at the example dependency diagram below:


So we have a module Work.py, that has a dependency on Command.py and Logger.py.  The problem is that Command.py also has a dependency on Logger.py, but of a different version.  At some point, Logger went through some kind of API breaking change, and Work uses the older version, but Command uses the newer one.  How can you load the Logger module in python?

PYTHONPATH won't work (nor with a .pth file).  Neither will locally installing Logger.py to different locations (instead of site-packages).  You can't for example, sys.path.extend(["/path/to/1.1/Logger.py", "/path/to/2.0/Logger.py"]).  Neither pip nor setuptools will help you because they will simply overwrite the module in site-packages (at least, AFAIK).  And finally, virtualenv or python3's new venv won't help either.

Why will none of these "standard" solutions work?  The problem is how __import__ works and how it loads module objects into sys.modules.  I'll be speaking here from a python3 perspective, as this is what I have done my research on.  It will be highly beneficial for readers to look at the python docs for the import system and importlib.  When a program calls import, three  basic operations are done:

1. Find the module
2. Create the module object
3. Bind the object into a global module namespace

 The first two are handled by the builtin __import__(), and the last is done by the import statement.  So the first thing python will do is look in the global namespace to see if the symbol is defined in the global module namespace.  If it finds it, it will return it.  This global module namespace can be seen by printing out sys.modules.  The key to the problem is in #3...python only has a global namespace for modules.  If you have two modules with the name Logger, the second time you try to load Logger, you will in fact get the first one.  Aliasing the module with:

    import Logger as Logger_ver2

Will not work.  So is the problem solvable? Fortunately it is.  You can use either the importlib or imp module to get around the problem, though it is a bit odd.  The key to understanding how this is works is in knowing that the import system uses Finders and Loaders.  Finders do the job of determining where to look for and check if a file really is a module (or package).  If the find is successful, this class returns a Loader object, and this Loader object actually returns the module object so that it can be used.  Using this method, you can give a name for the module as something other than the filename.  So you could for example load the module into sys.modules["Logger2"].

import importlib.find_loader

loader = importlib.find_loader("Logger", path=["/path/to/1.1"])
loader.name = "Logger1"  ## necessary since sys.module is a flat namespace
logger_v1 = loader.load_module()
logger = logger_v1.get_logger(__name__, "DEBUG")

loader = importlib.find_loader("Logger", path=["/path/to/2.0"])
loader.name = "Logger2"  ## change module name here too
logger_v2 = loader.load_module()
logger2 = logger_v2.get_logger(__name__, "DEBUG", "/path/to/logfiles")


The solution is a bit ugly, but it is possible. A more elegant solution would be to derive your own PathFinder and Loader classes.

However, this is just one small part of what a component based framework would have to provide. For example, how do you encourage programming to an interface and not an implementation? Isn't duck-typing enough? Sadly, no, it's not. Remember, an interface doesn't just specify the function signature, it also encapsulates the "what". An interface is supposed to hide the "how", but just because two classes both have a run() method, one class might mean that as launching a new process/thread and for another class it might mean something as totally different as telling the same process to start. In other words, a good interface doesn't just mimic the same function name, arguments and return type. It also exhibits the same behavior. How that behavior is achieved is irrelevant, but we need to make sure that everyone is saying the same thing.

Then there's the matter of dependency injection. Fortunately, this is much easier to do in a dynamically typed language like python, but it's still requires some thought, and what exactly a person means by "dependency injection". While decorators are powerful, they lack one crucial feature. Decorators can modify what happens before or after a function call. It can modify arguments to the function call. It can even replace the wrapped function entirely, or not run it at all. It can change the return type of a function. But what it can not do is modify what happens _inside_ the function. Unless you have the luxury is developing in a lisp dialect with their uber powerful macros, this will remain out of reach. The closest you can come to achieving this is to alter the AST of the function which requires some pretty deep knowledge of python internals. So how can you "truly" insert the dependency into a function? Fortunately, one thing decorators can do is change the arguments and even environment of a wrapped function.

Another aspect of many component based frameworks is the ability to dynamically add or remove components during the lifetime of a program. Any plugin system worthy of the name must be able to insert a plugin and provide features the plugin exposes. A more difficult matter is how to remove a plugin as other plugins may now have a dependency on another plugin. So a graceful removal of all plugins must be considered. Here again though the module system of python leaves a few things to be desired. The sys.modules acts as a cache and while it is possible to reload a module, entirely removing a module from the namespace is more tricky than just deleting the entry (afterall, sys.module is just a list).

Frankly I'm a little surprised at these "enterprise" features in python. While it's possible to do all the above, it requires some work. Indeed, there have been attempts by other python projects to provide the basis for these kinds of frameworks. Perhaps over time, I can contribute myself.