Under A Boddhi Tree: 2011

Saturday, October 29, 2011

Using emacs and leiningen on Windows for clojure pt. 2

Setting up leiningen

So in the last post, I described how to set up emacs and leiningen for Windows. We're actually not quite done with some of the tools we'll need. In the last post, we installed a package for emacs called the slime-repl. Now it's time to install a plugin for leiningen that lets SLIME talk to the clojure REPL.

There's a nifty plugin called swank-clojure that allows the SLIME protocol to talk to a running clojure process and interact with it. I can't go into too much detail into all the commands of SLIME itself since I'm still learning it as well, but I will show you how to install it, and run a REPL with it.

The first step will be actually installing swank-clojure itself. You can even do this from within emacs itself. To do this, type in M-x eshell in your minibuffer. You will get an emacs version of the command console. It's not exactly the cmd console, and if you've ever used MSYS or cygwin shells, it's kind of similar. For example, instead of using the command dir to list directories, you use ls. Also, to tab complete a directory, you use posix style forward slashes instead of back slashes (RANT ON: I have spent weeks at my job discovering and working around assinine Microsoft pathing issues. Whoever thought that having spaces in directory paths, or using backslashes...which are also escape characters...as path separators, needs to be forced to listen to Justin Bieber and Rebecca Black songs continuously for 24hrs with no breaks...RANT OFF).

In the eshell, you can run the command to install a leiningen plugin like this:

lein plugin install swank-clojure 1.3.3

Assuming you have no proxies (or your proxy for leiningen is correctly set up), and leiningen was successfully installed previously (did you remember to install curl or wget in your PATH?), then the command above will install the swank-clojure plugin for you.

Creating a leiningen project

So now we get to learning what leiningen provides for you. If you are a Java programmer, odds are that you are familiar with Maven (or maybe Ivy). Leiningen is for clojure what Maven is for Java. Indeed, you could use Maven for your clojure projects, but leiningen does many clojure specific things for you. It will also drastically reduce the time required to set up a clojure project.

The first step is to create your project. Leiningen has a command called new which sets up a base project for you. For example, if I am in directory C:\Users\stoner (my name is Sean Toner , thank goodness it's not Frank Ucker :) ), and I use the command:

lein new MyFirstCljProject

then leiningen will create a directory named MyFirstCljProject, and you will see a bunch of files and directories inside of it.

Looking into NoSQL

I've been looking into Graph Databases lately, mainly because they just seem interesting to me. My SQL skills are very basic. I've designed some rudimentary tables that have some basic contraints, and the foreign key primary key relations between tables has usually only been 3-4 columns at most. I can do your basic queries, including simple (inner) joins between 2 or 3 tables. And I can do basic inserts and updates. But if you start asking me how transactions work, what the best connection pool policy is, or anything more advanced than that, I will just give you a blank stare. Because I use SQL so infrequently, even simple things like updating a record in a table(s) requires me to look up the command.

My team lead at work has said that people he's trained either get relational DB or they don't. Apparently, I "get" it, but it still seems a bit awkward to me. Especially the syntax of SQL. I mean a non-case sensitive language (yeah yeah, LISP is case in-sensitive)? Also, SQL isn't Turing complete. You don't have branches or conditionals for example. I also had trouble modeling a tree type structure in a classic SQL database.

So out of curiousity, I started looking at NoSQL. NoSQL is a family of non-RDBMS style database technologies, including key-value stores (Redis), document databases (CouchDB and MongoDB), column family (Hadoop, BigTable), and graph databases (Neo4j, OrientDB) among others. The graph databases seemed to perk my interest for some reason, so I decided to investigate them.

As a consequence, I've been looking at Neo4j and OrientDB. Essentially, graph databases are, well....graphs :) You have Nodes (Vertexes), Edges (Relationships), and properties. As I understand it, these graphs are all directed, but the direction can go either way. What interested me about this style of modeling data is that the relationships are inherent in the graph itself. In other words, the traversal of the graph to get from one Node to another Node is your relationship. I immediately started seeing the advantages of this.

For example, imagine installing software. This is a graph structure, and in fact, it is a directed acyclic graph (DAG). Let's say you need to install a python script that requires sqlalchemy. What would this look like?

The edges are semantic notation, but notice how they are a relation? By traversing the graph, we can see how to install script.py. If the relation is either REQUIRES or INSTALLS_BY, then these are dependencies. I could have set a property inside either the edge or the node as to an installation rule. For example, in the pymysql node, I could have a property like:

install_rule: 'pip install pymsyql'

And for sqlalchemy, it would be the same:

install_rule: 'pip install sqlalchemy'

But what about python itself? or Pip? Either the graph could be extended to contain its own INSTALLS_BY edge, or the logic of the traversal could say that "if this is an end Node, look up the property 'install_rule', and use that for installation.

Well, that was just a thought experiment, but it looks way easier to do that than implement the equivalent in a SQL Table (or two). Not to mention that a graph data is many orders of magnitude faster (from what I've read, it's on the order of 1000x to 1500x faster).

Neo4j seems to have a cleaner architecture and is better documented, but it is GPLv3 only. OrientDB seems to have many more features (it is both a document db and a graph db, and it also has a subset extension of SQL to look up relations), but it does use a more permissive Apache license. Another advantage I noticed in Neo4j is that it is thread-safe...multiple threads can access the same graphdb object (probably because neo4j enforces that all mutation must occur inside a transaction).

Sadly, all the different Open Sources licenses really has me confused, and I'm not sure if my project uses some parts that are under EPL, some parts that are under BSD, and some parts under GPLv3, if that is even possible.

Tuesday, October 18, 2011

Using emacs and leiningen on windows for clojure: pt. 1

I have three books on clojure currently, Stuart Halloway's Programming Clojure (1st edition), Amit Rathore's Clojure in Action, and Michael Fogus and Chris Houser's The Joy of Clojure. Unfortunately, none of these books really tell you how to get started. They talk about the language of course, but setting up your development environment is not really covered in any of these books.

There is the pretty good reference site on getting started with any of the major IDE's available for clojure. I decided I would learn emacs, since it's something I've been meaning to do for a long time. I have used Eclipse counter clockwise plugin, but for some reason, I just didn't really feel like using it. Likewise, I didn't like how Enclojure forced you to use maven for your clojure project. I didn't play with Idea's La Clojure plugin, so it may be worth investigating. And for quick down and dirty experimenting, the clojure shell for jEdit really isn't that bad.

But as I mentioned, I wanted to use emacs. Because I have a brand new motherboard that uses UEFI, I have had problems finding a linux distro I can install on it (I tried Mint and Sabayon, and neither of them liked my new hardware). So I decided to run emacs on Windows. Also, I might be able to help people who are also using emacs on windows, and run into trouble when the documentation clearly only gives instructions for linux users.

Before I go too much further into emacs, let's install a different piece of software for your clojure development which we will use with emacs: leiningen.

Setting up leiningen

Leiningen is a project management tool for clojure, much like Maven is for java (at least until polyglot maven comes out). Indeed, leiningen uses some maven for dependency management behind the scenes. But instead of using XML as the declarative language for your project, you use clojure itself! Leiningen offers many nifty features, including but not limited to:

Dependency Management
Can perform AOT compiling if indicated
Can launch a REPL
Can start a swank session

For now though, let's just install leiningen. It's a little easier for linux users since every distribution I know of comes with either wget or curl. For windows, you'll need to grab one of them. I used curl myself. Both are just .exe files, and they don't have or require installers. Just make sure you put the .exe file somewhere in the system's %PATH%

Once you have wget or curl in %PATH%, you can install leiningen. Grab the stable leiningen for windows, and put the lein.bat file somewhere in %PATH% again. Once this is done, run the following command in a console:

lein self-install

If you are behind a proxy, make sure you read the directions on the leiningen site about how to specify your proxy. Otherwise, we are done for now.

Python and Clojure comparisons: functional programming

Doing these comparisons actually helps me learn clojure since I'm more familiar with python. I figure that this is probably true for a lot of other people, so hopefully this will help others learn clojure as well. OTOH, what I will cover here is more of a functional style of python programming that people might not be used to. So I hope this will also help people figure out the functional style of programming along with me. And as a warning, I am no functional guru. I'm a guy whose first language was C++, then C, then python, then Java...and I forgot the order of everything else. I've only recently begun programming in a functional style. But I hope that since I am new at this, I'll be in the same boat as others trying to learn this way of programming as well.

So in this entry, I'll cover some functional style programming between the two languages. But before I go too deep, let me begin with a brief introduction on my own understanding of functional programming. As the name suggests, functional programming puts functions as first class citizens. This means that functions can be passed as arguments to functions, or they can be returned as arguments. Functions can also "close over" data inside that function, and this is what the term closure means (and where the play on words, clojure, comes from) . Functional programming also takes after math. In math, functions take parameters belonging to some domain, and map an element from the domain to a range. Therefore, functions in functional programming languages seem to frown upon taking no arguments and returning void (returning nothing). Imagine a mathematical function:

nil = f()

What would that even mean? But now I'll go over some common ground between the two languages.

In clojure, it is often idiomatic to use the functions map, apply, reduce, or filter to solve problems. When I first was learning clojure, I had some difficulty in understanding the difference between the first three of these functions.

The map function takes as parameters, a function, and one or more collections. The number of collections passed in depends on the function that is passed in to map. If the function takes 2 arguments, then you pass in 2 collections. If the function takes 3 args, then you pass in 3 collections, etc etc. The map function will take the 1st element in each collection, and pass them as arguments to the function we are using. Here's a trivial example in clojure:

(map #(+ % %2 %3) [ 1 2 3 ] [4 5 6] [10 20 30])

Notice the #(+ % %2 %3) anonymous function. This is idiomatic clojure's way of defining an anonymous function. However, I could have also done this:

(fn [ f s t ] (+ f s t))

Anonymous functions in clojure are similar to lambdas in python but more powerful. Lambdas in python can only contain a single expression, thus limiting their power. The equivalent lambda in python would look like this:

lambda x, y, z: return x + y + z

But getting back to clojure's map function, the return of map is a lazy sequence containing all the mappings of collection values passed to the function inside of the map. One key to remember here is that we return a lazy sequence, and as such, sometimes it might be necessary to iterate through all values (using a doall).

So how about python's map? Python's map is similar except that it doesn't return a lazy sequence (a generator in python terms). The equivalent python code of the above would be this:

map(lambda x,y,z: x + y + z, [1, 2, 3], [ 4, 5, 6], [10, 20,30])

In python, you might be tempted to do this instead and it would return the same result:

[ x+y+z ; for x,y,z in zip([1,2,3], [4,5,6], [10,20,30]) ]

However, this has a disadvantage compared to the clojure way. The disadvantage is that clojure's map returns a lazy sequence, but neither python's map, nor its list comprehension is lazy. This means that for huge collections, you may run out of memory. However, there IS a way to create a lazy sequence in python, but don't use map() and don't use the list comprehension above either. Imagine if I had done this:

def increment(start=0, inc=1):
  i = start
  while True:
    yield i
    i += inc

def get_item(gen, i):
    res = gen.next()
    x = 0
    while x != i:
        res = gen.next()
    return res

needed_value = get_item(increment(), 100000) ## if you have a huge sequence

By calling get_item(), it will only return the value of gen at some iteration. The advantage to this style is that since it is using a generator, it is not consuming the entire list in memory. If i used a generator expression using for example xrange, and used list() to realize he values, this could be a lifesaver. This same concept applies to clojure's lazy sequences. Rather than hold the entire sequence in memory, a lazy sequence only holds as much as you require. Sometimes, when you need to iterate over the whole sequence, you have to force the evaluation using doall.

Update:
So python also has generator comprehensions as well as list comprehensions. Although the code above shows how a generator is manually created, there's a simpler syntax.

for term in (x+y+z ; for x,y,z in zip([1,2,3], [4,5,6], [10,20,30])):

    print term

Notice the use of () parens instead of [] that list comprehensions use. The above code would create a generator and the for term in would operate on this generator. This way you get the advantage of a lazy evaluator.

So that covers map...what about reduce? Perhaps you've heard of the map/reduce algorithm used by projects from Google or Hadoop. We've just seen above what map does. It applies values from one or more collections, and passes them to a function, and the return values are put inside of a sequence. The reduce method is similar, but it takes a specific kind of function. This function must take 2 arguments. Also, reduce only takes one collection. So you will often see the result of a map, applied to a reduce since the map call returns a collection, and reduce takes only one collection. The result of a reduce is a scalar, not a collection of some sort.

As its name suggests, reduce reduces a collection one by one in order to arrive at a scalar value. The classic example of reduce is a summation. For example, you can sum the series of numbers like this:

(reduce #(+ % %2) (take 10 (iterate inc 1)))

That would sum the numbers 1 - 10. If you're not familiar enough with clojure yet, I'll break it down. The inner (iterate inc 1) returns a lazy (infinite) sequence. The inc function adds 1 to the argument passed in. So (iterate inc 1) is saying, "create an infinite sequence starting at 1, and each succeeding element is 1 + the last element". In other words (1 2 3 4 5 ...). The take function takes a number of elements to grab from a sequence. So (take 10 (iterate inc 1)) says, "give me the first 10 elements of the natural numbers, and return them as a sequence". This is good since reduce requires a sequence as its second argument. The #(+ % %2) is an anonymous function that simply adds the two arguments together. The % and %2 are placeholders for arguments.

When reduce is called the first time, it will take the first 2 elements from the sequence, and pass these to % and %2 respectively. The result of #(+ % %2) is used as the incoming argument to % on all succeeding calls, and the next element in the sequence is used for the %2 argument. So for example, the first several calls would look like this:

(+ 1 2) ;; returns 3
(+ 3 3) ;; returns 6
(+ 6 4) ;; returns 10
(+ 10 5)
...

So hopefully that explains map and reduce, but what about apply? Actually, python doesn't really have a version of apply...at least not what clojure considers an apply. Since I can't really compare, I'll move on to something else.

A common scenario is creating a vector composed of the same index element in all the other vectors. For example, given these arrays:

a = [ 1, 2, 3]
b = ['a', 'b', 'c']
c = ['first', 'second', 'third']

I want to create a set of vectors like this:

d= [[1, 'a', 'first'], [2, 'b', 'second'], [3, 'c', 'third']]

Python has a built in function called zip() which does this for you.

d= zip(a, b, c)

The only difference is that python will return a tuple rather than a list of the newly created zip. For clojure, you may think the equivalent is zipmap, but it's not. Clojure's zipmap only takes two sequences, it doesn't take more (or less) than that. So how do you pass in an arbitrary number of sequences as we did with python's zip() function? One solution is a map.

(map vector (iterate inc 1) ["a" "b" "c"] [:first :second :third] )

Try passing that into the REPL, and notice what you get. You actually get a sequence, not another vector. This is something to watch out for in Clojure. Test it out by wrapping the call above inside of (class ).

So how about iterating through things? Is it true that in Clojure (and other loops) you always have to do recursion? Well, not exactly. But since this is a rather complex topic, I'll save that for a later blog :)

Saturday, August 27, 2011

Python and Clojure comparisons

Since I've slowly been trying to move to a functional style of programming (despite most companies in the industry still wanting all kinds of esoteric knowledge about OOP, design patterns and other OO madness like UML), I thought it might be interesting to contrast some python code with clojure code. The python code I'll show here is definitely not the norm, but still valid.

Since I just covered clojure destructuring, it might be helpful to see the equivalent usage in python. Take for example a common use in python for passing in arguments to functions like these:

 1 def get_positional_args(*args):
 2   for arg in args:
 3     print arg
 4     
 5 def get_positional_arg1(farg, *args):
 6   print "first argument is", farg
 7   for i, arg in enumerate(args):
 8     print "argument ", i + 2, "=", arg
 9     
10 def get_keyword_args(**kwargs):
11   for name in kwargs:
12     print name, "=", kwargs[name]
13     
14 def get_pos_and_kw_args(farg, sarg, *args, **kwargs):
15   print "First argument is", farg
16   print "second argument is", sarg
17   for i, arg in enumerate(args):
18     print "Argument #", i+2, "=", arg
19   for name in kwargs:
20     print name, "=", kwargs[name]

And these are the equivalent functions in clojure:

 1 (defn get-positional-args  [ & more ]
 2   (println more))
 3 
 4 (defn get-positional-arg1   [ x & more ]
 5   (println x)
 6   (println more))
 7 
 8 (defn get_keyword_args [ m ]
 9   (doseq [ name m ]
10     (println "key =" (name 0) "value =" (name 1))))
11 
12 (defn get-pos-and-kw-args [ f s m & more ]
13   (println "first arg is " f)
14   (println "second arg is " s)
15   (doseq 
16     [ arg (map vector (iterate inc 3) more) ]
17     (println "Argument" (arg 0) " is " (arg 1)))
18   (doseq [ kv m ]
19     (println "key = " (kv 0) " value = " (kv 1))))

Now, I don't know about you, but I think that the argument that 'lisp' has too many parenthesis isn't really all that true for clojure. Sure, it has way more than python, and python is perhaps the cleanest looking language I've ever seen, but the other lisps/schemes I've seen aren't in the same league as clojure when it comes to reader friendliness.

So what are some other things that python and clojure have in common? Believe it or not, python has a lazy language feature as well. Clojure prefers using lazy sequences whenever possible (though it's not a lazy language by default like Haskell). Still, through the use of Clojure macros or functions like delay and force, clojure can (explicitly) be made very lazy. Python isn't nearly as lazy, but it does have one nice lazy feature....generators.

Generators and generator expressions are a way to explicitly run through a (possibly infinite) sequence. The key to generators is the 'yield' statement, which "freezes" the functions until the generator object's next() function is called. For example, we can create an infinite series of even numbers:

 1 def gen_even():
 2   i = 0
 3   while True:
 4     yield i
 5     i += 2
 6     
 7     
 8 def iterate_range(gen, i):
 9   res = None
10   for x in range(i):
11     res = gen.next()

However, the implementation above would not create a sequence as most people think (though technically, the generator itself has a next() method ). A generator is really just an object with a next() method, and when you call that next() method, it 'yields' a value. The next next() call generates another value. But a generator by itself does not generate a sequence. To generate a list, let's do this:

12 def create_map(gen, i)
13   return [ gen.next() for x in xrange(i) ]

What would happen if we call this (note, use xrange rather than range, since range has to generate a list consuming more memory)?

g = gen_even()
create_map(g, 10)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

This is very similar to lazy sequences in clojure. For example, the equivalent in clojure to the above would be this:

(def v (iterate #(+ 2 %) 0))

I could perform the following actions on the lazy sequence v

(take 10 v)
(nth 9 v)

The first line would yield the same as python's create_map() above. The second function would return the 10th item in the sequence. The advantage of both python's generator and clojure lazy sequence is that the entire sequence is NOT stored in memory. Only whatever is required for the calculation is needed.

So why would you use generators or lazy sequences? One possibility is to eliminate recursive calls. Recursion in both python and clojure consumes stack space, and thus might blow out when the recursion depth is very large. Another possibility is to model infinite sequences such as those found in math. For example, that the sum i -> inf where f(x) = 1/2^i is 1.

def half_gen():
  i = 2
  while True:
    yield 1.0/i
    i = i * 2
    
h = half_gen()
sum([ h.next() for x in xrange(100000) ])

Wednesday, August 24, 2011

Multiprocess TaskServer is working

I was just about to write a blog describing how I took one step forward and took one step back. I spent a good chunk of Saturday and today working on getting teleproc to be able to run multiple processes. The old TaskServer class is now the Task class, and the TaskServer class is now a container of Tasks (yes, I know, I am breaking the API, but since this isn't even an alpha, and thus the API is NOT stable, I don't have a problem with that).

Basically the TaskServer receives commands from a client, and directs those commands to a Task object. So the client has to now specify which Task object he is interested in. To avoid intermingling the stdout of all the processes, I am prepending all of them with a local ID. I will rewrite the GUI client to filter each line based on the ID and strip it out. It will then append the given line of text (from a given process) to a output area (probably a TabbedPane or something in swing or SWT).

The trick to reconnecting was that when the GUI client disconnected, the Channel object disconnected with it. When the GUI client reconnected, a new Channel object is created by Netty. I had to save this Channel object and set all of my Task object's readOut and readErr ThreadReader objects to use this new channel. So I had to add some logic to do this in my ThreadReader class.

I still haven't implemented the broadcast feature yet though. I want all clients who are connected to be able to get the same output. I also am stoked to convert this into a clojure project. Now that I am a little more familiar with how Netty works, I think I can use this as the basis for the networking engine for my game. I also believe that there are a couple of areas that clojure will shine in. For example, my parseCommand function is pretty ugly in java, and I think there's a more elegant solution I can use in clojure.

I also need to make the actual Eclipse Public License documents since that is the Open Source license I am going to use. But the directions are really spotty. I did find one webpage with a checklist of documents, so I will do that soon.

Monday, August 22, 2011

Something new to learn...emacs

I've mostly just been using jEdit as my REPL of choice for clojure, with Eclipse's counterclockwise plugin to actually edit clojure files. However, I don't like how ccw doesn't let me fire up a standalone REPL, and I didn't like how jedit's clojure repl didn't have a paredit feature. I also don't like the enclojure plugin for Netbeans which forces you to use Maven style projects for clojure (why do that when there's the excellent leiningen?). Since Clojurebox is no longer maintained, I figured I may as well FINALLY learn emacs. I used to joke that you can't really call yourself a linux hacker unless you know either vim or emacs really well.

There are several advantages to using emacs. Firstly, it seems like most clojurians tend to use it because of the lisp heritage of emacs. Using swank-clojure and clojure-mode, I can edit clojure files and have a repl too straight from leiningen. Also, emacs with swank-cdt appears to be one of the only ways of debugging clojure code (allowing you to set breakpoints and such).

On the downside, emacs is....painful. It is more than an editor, it is basically it's own little ecosystem. I tried learning vim before, but the whole 'command-mode' vs. 'edit-mode' screwed me up. I am hoping that emacs 'buffers' don't do the same to me. Then there's also the complexities of SLIME itself.

As it turns out, getting emacs up and running for Windows was a teensy bit trickier than I had hoped for. Although the description for getting emacs up and running at the official clojure dev site was helpful, it was also geared towards *nix users. For example, it tells you to create if necessary, an ~/.emacs.d folder. So I assumed that in Windows, the ~ (the linux HOME directory) would be the Windows equivalent of C:\Users\<userdir>. That's not the case. It's actually C:\Users\<userdir>\AppData\roaming.

But other than that little trick, things turned out fairly well. So I am now on my way to understanding the lovely world of working with and in emacs :)

UPDATE:
So far, emacs doesn't seem nearly as bad as it was when I tried to learn VIM. I do need to get Marmalade set up, though that requires an .emacs file which might interfere with the Starter Kit settings. Nevertheless, emacs is pretty cool. I like the regex search feature and being able to forward word by word instead of just by beginning or end of line (like with the home or end keys). The clojure mode syntax highlighting is nice, though I still need to figure out how to use swank-clojure inside of emacs.

Also, it appears that along with swank-cdt, there is also a project called ritz (forked from swank-clojure) which allows for setting breakpoints in clojure code.

Saturday, August 20, 2011

Clojure learning assignment: destructuring

I've decided to do a once weekly at least foray into topics in clojure to both help myself and others learn this fascinating language. I do not really have a lisp background (I do not count the few hundred lines of code I wrote in an AI class a long time ago), and thus clojure has been a bit of a mind-warping stretch for me. Here are some of the topics I will eventually cover:

1. Destructuring
2. Coverage of the "cheat sheet" functions with examples
3. Lazy sequences vs. recursion
4. Macros (once I've figured them out myself)
5. A multithreaded merge sort using refs
6. A multithreaded tree traversal
7. Examples of gen-class and proxy
8. Examples of using defrecords

And anything else I can think of. Mind you, I'm still learning this language in many ways as I go. What I hope that I can provide that the clojurian master's may not be able to, is the perspective of a newbie to the lisp and functional programming world. I personally learn through real-world examples, and while the books The Joy of Clojure, Programming Clojure, and Clojure in Action have all been very helpful, sometimes I wish there had been a little bit more attention paid to the small things. Just try and do a (doc ->>) and you will know what I mean. But anyhow....off to my first lesson, destructuring.

I often hear that Clojure (and lisps) have very little syntax, and they tout this as a defining feature of the language. While to a degree that's true compared to many other languages, there ARE more syntax rules than may appear at first glance. Take for example code like this

   1 (defn get-parts 
   2   [ [x y z & others ] ]
   3     (do 
   4       (println "First three are: " x y z)
   5       (println "Rest is: " others))
   6     others)

Normally, the first thing you'll see after the function name is the argument list (or possibly a docstring or metadata see here). But that's a kind of strange looking argument list. What am I supposed to pass in there? It kind of looks like I pass in an array of symbols...but what's that ampersand doing there?

We can run it like this:

user=> (get-parts [ 1 2 3 4 ] )
First three are: 1 2 3
Rest is: (4)
(4)

This is one of clojure's destructuring forms which is loosely akin to pattern matching found in other languages. The above code takes in a sequence of some form, splits out the first three values into x, y, and z respectively, and then stuffs the remainder of the sequence into others. It would be code equivalent to this:

 1 (defn get-parts-no-dest
 2   [ s ]
 3   (let [ x (nth s 1)
 4          y (nth s 2)
 5          z (nth s 3)
 6          others (drop 3 s) ]
 7     (do
 8       (println "First three are: " x y z)
 9       (println "Rest is: " others))
10     others))

As you can see, the destructuring above did cut down on some lines of code...if at the price of some readability in my opinion. Unfortunately, using destructuring seems to be the preferred idiomatic clojure style.

So the above example works well for a vector as well as a list or sequence. It will not however work on a map of any sort. If we try, we will get this:

user=> (get-parts { :1 1 :2 2 :3 3 :4 4} )
java.lang.UnsupportedOperationException: nth not supported on this type: PersistentArrayMap (NO_SOURCE_FILE:0)

So are there destructuring forms for maps? Of course. Here's an example where we take a map containing the keys fname, address and city, print them and return a vector of the values of those keys:

1 (defn get-parts-map
2   "Takes a map with keys fname, address and city and prints them"
3   [ {:keys  [fname address city]  } ]
4   (do
5     (println "Name: " fname)
6     (println "Address: " address)
7     (println "City: " city))
8   [ fname address city ])

If we called it with a map like { :fname "John Doe" :address "1234 Cherry Lane" :city "Timbuktu" }, we would see this:

user=> (def john_doe { :fname "John Doe" :address "1234 Cherry Lane" :city "Timbuktu" } )
#'user/john_doe
user=> (get-parts-map john_doe)
Name: John Doe
Address: 1234 Cherry Lane
City: Timbuktu
["John Doe" "1234 Cherry Lane" "Timbuktu"]

Notice how we used the :keys keyword and followed it with a vector of symbols, and not keywords. Keep that in mind when destructuring using maps. Also, you can use these destructuring features in let forms as well. For example, I could have written the code above like this:

 1 (defn get-parts-map-w-let
 2   "Takes a map with keys fname, address and city and prints them"
 3   [ m ]
 4   (let [ {:keys  [fname address city]  }  m ]
 5     (do
 6       (println "Name: " fname)
 7       (println "Address: " address)
 8       (println "City: " city))
 9     [ fname address city ]))
10

And the output would be exactly the same as above. As well as the :keys directive, you may use :syms, if the keys are symbols (instead of keywords) or :strs (if the keys are strings).

The other useful destructuring form is to associate a map with the elements of a sequence. For example, you could do something like this:

1 (let [ { dog 0 cat 1} [ "husky" "persian" "pug" "siamese" ] ]
2   (println "Dog is a " dog " and cat is a " cat))

This would print out "Dog is a husky and cat is a persian

Monday, August 15, 2011

Training others

I've come into the responsibility of training some new, and not so new people how to program. Right now, I am teaching them the basics of the language we are (mostly) using, but I am also trying to teach them some of the finer points of software engineering that I had to learn from experience. Some of the people I am training don't have Computer Science or Software Engineering degrees, but do have Electrical or Computer Engineering degrees. So I'm trying to impart just some general guidelines on writing decent code.

Working as a team-
Many of the other points I will cover below have this as a root element to consider. When I went to school, I had a grand total of two group projects, only one of which actually had any code to it. That's totally unrealistic in the real world. The fact is, your code and your work does not live in isolation. Your code should be readable by others, they should know where to obtain your code, you should not duplicate an entirely new library that someone else has built (though you can make enhancements or improvements to it), and you should document your code so that others know how to install and use what you created.

Revision Control-
I have found it amusing that at 2 different workplaces, the Electricial Engineers were somewhat in arms over having to learn a revision control system, and yet the CS people were more fascinated by it. Unfortunately, when I went to school, they weren't teaching anything about revision control systems, much less why you would need one. And sometimes you do have to explain to someone why you would need one. But without revision control, how do you experiment with your code? How do you tag your code so that you can replicate an issue a customer is seeing? How do you distribute your code so that others can see it and possibly make enhancements to it? Many engineers are frightened when they first attempt to use a revision control system, because they are afraid they will jack up someone's code base. Also, some revision control systems are easier to learn than others (I personally am finding git far harder to learn than mercurial). But these are small drawbacks compared to what a revision control system provides

Code Reviews-
Many engineers are scared of code reviews when they first start. Throughout school, you are ingrained not to share your code with others, and as a consequence you don't have to worry about what your code looks like. But once a new engineer accepts the fact that many eyes will be seeing his code, this alone changes how he writes (or at least will change once several comments come in). But code reviews are necessary because they are the next thing to find bugs after your unit tests. I also remind that when someone is a reviewer, that they should actually try to understand the code, rather than look for just superficial things like coding standards. This takes more time, but I believe that it improves your own coding as well as helping the one being reviewed.

Unit testing-
Usually in the madhouse rush to get something working, testing is thrown to the wayside. I am NOT a proponent of TDD, where you write your tests before you actually write your feature code, but one should eventually write tests for their code. When using a dynamic language, it is often necessary to check that the type of arguments passed in is correct. Make sure you write lots of negative tests too, because it's rather embarrassing to discover that invalid inputs makes your function return a supposedly valid result.

Reusability-
I usually give an anecdote for this. Imagine that you write a script that performs some functionality for a test you have. Later, you are tasked with a very similar problem, and so you write a 2nd, albeit slightly different script. And then you do so later for a 3rd and 4th script. But then, some new functionality in a library your scripts uses changes. Perhaps a new product is made which requires a different parameter to be passed in. Now, you have 4 scripts, and you have to go in and change all 4 of them. Always try to isolate code that could possibly vary and keep it in a library, class or module of some sort. I try to stress writing functions over writing scripts so that I have only one or two scripts, whose behavior changes depending on the arguments that get passed in.

DON'T copy and paste-
Also known as DRY (don't repeat yourself), copying and pasting code is BAD. Why is it bad? Because when you copy and paste code, you copy and paste bugs. And when you need to make an enhancement to your code, every place you copied and pasted now has to be fixed as well. As obvious as this one sounds, I am amazed in code reviews how many people simply copy and paste functions or worse, parts of functions into other functions.

Keep it simple stupid-
General Patton once said, "Don't give great orders. Give orders that can be understood". As mentioned earlier, code is read more than it is written. If your code tries to get too fancy, you might want to make it easier for others to understand. Of course this has its limits. If the most efficient code is complex, don't be afraid to do that, just comment the heck out of what your code is doing.

Avoid functions that return void-
This sort of goes along with unit testing, or perhaps testing in general. Functions that return void are usually either mutating state of some object (either of itself, if this is a method, or of some argument that is passed in), or they are impure, and only have validity for some side effect (for example, updating a database, or printing to a log file). The trouble is, how do you test this? If the function is a method of an object, and it mutates some field in the object, now you need a second function that has to be called to make sure the field in that object is correct. But what if this is a multi-threaded program? It is entirely possible that another function can change the state of the object before your test function gets a chance to run. Now you have to write some locks to make sure this is correct. All of this can be avoided if you simply return some values and then you can check those values (which although the data might be stale, was valid at the time of the original function call).

Document, document, document-
One of the reasons that python has such elegant syntax is that Guido Van Rossum had the insight that code is read far more often than it is written. An engineer should make it even easier for people to understand your code by making copious documentation. Now, one shouldn't comment the obvious, but if anything might be even remotely unclear, it's a good idea to comment for others (and yourself!!) on what your code is trying to accomplish. Also, learn the markup tool of choice for your language (doxygen, sphinx, doxia, javadoc, etc), as being able to publish a pdf or to have the documentation in html format is really really nice.

Use the debugger as a last resort-
This is where a lot of people may disagree with me, but to me, debuggers are the big guns of troubleshooting. Prefer loggers to debuggers when possible. For example, in C or C++, debugging macros or templates is very difficult. Loggers on the other hand can expand the macro for you, and you can also print out any genericized object. An exception to this is when you are learning someone else's code, and you want to figure out what is going on. Very complex code almost requires this.

Optimize AFTER your code works-
Unless you know a good algorithm right from the beginning, make something work first then make it faster. However, do keep in mind the following:

1. Nested for loops are almost always a bad sign (n^k runtime efficiency where k is # of nested loops)
2. Sorting a data structure is usually more efficient than trying to find something randomnly (nlogn)
3. When using recursion, watch out for potentially huge values being passed in (which will blow your call stack)
4. When using recursion, watch out for function calling itself more than once (ie, fibonacci...n ^ k big O of n).
5. Don't be afraid of recursion. Yes, it pushes a new function on the call stack and thus is slower, but often, recursive solutions are easier to understand than an equivalent for or while loop.
6. Be wary of cyclic data structures or potential ones (eg, a linked list where one node points to a previous node). Your code might work on a non-cyclic data structure, but a cyclic one might make you spin forever or blow your call stack away.