Saturday, October 26, 2013

Something like OSGi for python

Ok, so I have been knee deep into python again for my new job as an automation engineer at Fusion-io.  Although python is a nice language for small to medium sized project, when you start getting into the middle tens of thousands of lines of code, python's "convention over bondage" philosophy starts to wear a little thin.

One problem I have been thinking of tackling is creating an OSGi like component based framework for python.  If you aren't familiar with OSGi for Java, it's a modularity system for Java that among other things, helps resolve CLASSPATH issues, versioning problems between jars, dynamic lifecycle of modules, and helps design more modular and reusable code.  Perhaps you are thinking that python already has modules and packages, and this isn't necessary.  But hold on, let's examine this little problem in more detail.

It's quite common in the Java world to hear the expression, "program to an interface, not an implementation". Java, lacking multiple-inheritance, has Interfaces to help resolve that problem.  But an interface's true strength is that it enforces the set of behavior that something should expose.  The idea is that the implementation should be a black box to the client programmer...he just wants the functionality exposed from the service.

You might for example have many different ways of retrieving an object over the net.  You might use FTP, or SFTP, SCP, http, or one of many ways.  Even amongst these choices there are subchoices.  Do you use python's built in ftp library?  Or do you have some ftp client utility that you want to wrap using subprocess?  Do you see the problem now?

The implementations change, but they all have a set of common functionality; the interface.  That is what you program to...the interface...not the module or class that exposes the concrete implementation of some functionality.  Being able to code against an interface allows you to swap out the underlying implementation.  Perhaps you are wondering why you would need to do that?

It is a sad fact that requirements are always changing or software breaks.  Perhaps some tool you are using changed the output and you were screen scraping, or the function defintion to some remote API you were using changes.  By decoupling the interface and the implementation, it eases these kinds of problems.  It also allows you to create an extension or plugin based system.

I can already hear some people say that python has no need for a plugin system.  Its dynamic nature makes plugins a "built in" feature of the language.  That's true, but it's also like the wild west.  Even python itself realized this with the creation of abstract base classes.  By creating an ABC, it allows a python developer a way to specify that a subclass must override and implement a set of methods.  If you use monkey patching to "plugin" new functionality, there's no way to guarantee that whatever got monkey patched is honoring the interface's contract.

Let's consider a more concrete problem.  Look at the example dependency diagram below:


So we have a module Work.py, that has a dependency on Command.py and Logger.py.  The problem is that Command.py also has a dependency on Logger.py, but of a different version.  At some point, Logger went through some kind of API breaking change, and Work uses the older version, but Command uses the newer one.  How can you load the Logger module in python?

PYTHONPATH won't work (nor with a .pth file).  Neither will locally installing Logger.py to different locations (instead of site-packages).  You can't for example, sys.path.extend(["/path/to/1.1/Logger.py", "/path/to/2.0/Logger.py"]).  Neither pip nor setuptools will help you because they will simply overwrite the module in site-packages (at least, AFAIK).  And finally, virtualenv or python3's new venv won't help either.

Why will none of these "standard" solutions work?  The problem is how __import__ works and how it loads module objects into sys.modules.  I'll be speaking here from a python3 perspective, as this is what I have done my research on.  It will be highly beneficial for readers to look at the python docs for the import system and importlib.  When a program calls import, three  basic operations are done:

1. Find the module
2. Create the module object
3. Bind the object into a global module namespace

 The first two are handled by the builtin __import__(), and the last is done by the import statement.  So the first thing python will do is look in the global namespace to see if the symbol is defined in the global module namespace.  If it finds it, it will return it.  This global module namespace can be seen by printing out sys.modules.  The key to the problem is in #3...python only has a global namespace for modules.  If you have two modules with the name Logger, the second time you try to load Logger, you will in fact get the first one.  Aliasing the module with:

    import Logger as Logger_ver2

Will not work.  So is the problem solvable? Fortunately it is.  You can use either the importlib or imp module to get around the problem, though it is a bit odd.  The key to understanding how this is works is in knowing that the import system uses Finders and Loaders.  Finders do the job of determining where to look for and check if a file really is a module (or package).  If the find is successful, this class returns a Loader object, and this Loader object actually returns the module object so that it can be used.  Using this method, you can give a name for the module as something other than the filename.  So you could for example load the module into sys.modules["Logger2"].

import importlib.find_loader

loader = importlib.find_loader("Logger", path=["/path/to/1.1"])
loader.name = "Logger1"  ## necessary since sys.module is a flat namespace
logger_v1 = loader.load_module()
logger = logger_v1.get_logger(__name__, "DEBUG")

loader = importlib.find_loader("Logger", path=["/path/to/2.0"])
loader.name = "Logger2"  ## change module name here too
logger_v2 = loader.load_module()
logger2 = logger_v2.get_logger(__name__, "DEBUG", "/path/to/logfiles")


The solution is a bit ugly, but it is possible. A more elegant solution would be to derive your own PathFinder and Loader classes.

However, this is just one small part of what a component based framework would have to provide. For example, how do you encourage programming to an interface and not an implementation? Isn't duck-typing enough? Sadly, no, it's not. Remember, an interface doesn't just specify the function signature, it also encapsulates the "what". An interface is supposed to hide the "how", but just because two classes both have a run() method, one class might mean that as launching a new process/thread and for another class it might mean something as totally different as telling the same process to start. In other words, a good interface doesn't just mimic the same function name, arguments and return type. It also exhibits the same behavior. How that behavior is achieved is irrelevant, but we need to make sure that everyone is saying the same thing.

Then there's the matter of dependency injection. Fortunately, this is much easier to do in a dynamically typed language like python, but it's still requires some thought, and what exactly a person means by "dependency injection". While decorators are powerful, they lack one crucial feature. Decorators can modify what happens before or after a function call. It can modify arguments to the function call. It can even replace the wrapped function entirely, or not run it at all. It can change the return type of a function. But what it can not do is modify what happens _inside_ the function. Unless you have the luxury is developing in a lisp dialect with their uber powerful macros, this will remain out of reach. The closest you can come to achieving this is to alter the AST of the function which requires some pretty deep knowledge of python internals. So how can you "truly" insert the dependency into a function? Fortunately, one thing decorators can do is change the arguments and even environment of a wrapped function.

Another aspect of many component based frameworks is the ability to dynamically add or remove components during the lifetime of a program. Any plugin system worthy of the name must be able to insert a plugin and provide features the plugin exposes. A more difficult matter is how to remove a plugin as other plugins may now have a dependency on another plugin. So a graceful removal of all plugins must be considered. Here again though the module system of python leaves a few things to be desired. The sys.modules acts as a cache and while it is possible to reload a module, entirely removing a module from the namespace is more tricky than just deleting the entry (afterall, sys.module is just a list).

Frankly I'm a little surprised at these "enterprise" features in python. While it's possible to do all the above, it requires some work. Indeed, there have been attempts by other python projects to provide the basis for these kinds of frameworks. Perhaps over time, I can contribute myself.

1 comment:

  1. Thanks much for your post, your examples are helpful to convince the doubtful (me included) about the benefits of a component based framework,
    Did you have a look at (iPOPO : A component model for Python) https://ipopo.coderxpress.net/wiki/doku.php ?

    ReplyDelete