Under A Boddhi Tree: October 2011

Saturday, October 29, 2011

Using emacs and leiningen on Windows for clojure pt. 2

Setting up leiningen

So in the last post, I described how to set up emacs and leiningen for Windows. We're actually not quite done with some of the tools we'll need. In the last post, we installed a package for emacs called the slime-repl. Now it's time to install a plugin for leiningen that lets SLIME talk to the clojure REPL.

There's a nifty plugin called swank-clojure that allows the SLIME protocol to talk to a running clojure process and interact with it. I can't go into too much detail into all the commands of SLIME itself since I'm still learning it as well, but I will show you how to install it, and run a REPL with it.

The first step will be actually installing swank-clojure itself. You can even do this from within emacs itself. To do this, type in M-x eshell in your minibuffer. You will get an emacs version of the command console. It's not exactly the cmd console, and if you've ever used MSYS or cygwin shells, it's kind of similar. For example, instead of using the command dir to list directories, you use ls. Also, to tab complete a directory, you use posix style forward slashes instead of back slashes (RANT ON: I have spent weeks at my job discovering and working around assinine Microsoft pathing issues. Whoever thought that having spaces in directory paths, or using backslashes...which are also escape characters...as path separators, needs to be forced to listen to Justin Bieber and Rebecca Black songs continuously for 24hrs with no breaks...RANT OFF).

In the eshell, you can run the command to install a leiningen plugin like this:

lein plugin install swank-clojure 1.3.3

Assuming you have no proxies (or your proxy for leiningen is correctly set up), and leiningen was successfully installed previously (did you remember to install curl or wget in your PATH?), then the command above will install the swank-clojure plugin for you.

Creating a leiningen project

So now we get to learning what leiningen provides for you. If you are a Java programmer, odds are that you are familiar with Maven (or maybe Ivy). Leiningen is for clojure what Maven is for Java. Indeed, you could use Maven for your clojure projects, but leiningen does many clojure specific things for you. It will also drastically reduce the time required to set up a clojure project.

The first step is to create your project. Leiningen has a command called new which sets up a base project for you. For example, if I am in directory C:\Users\stoner (my name is Sean Toner , thank goodness it's not Frank Ucker :) ), and I use the command:

lein new MyFirstCljProject

then leiningen will create a directory named MyFirstCljProject, and you will see a bunch of files and directories inside of it.

Looking into NoSQL

I've been looking into Graph Databases lately, mainly because they just seem interesting to me. My SQL skills are very basic. I've designed some rudimentary tables that have some basic contraints, and the foreign key primary key relations between tables has usually only been 3-4 columns at most. I can do your basic queries, including simple (inner) joins between 2 or 3 tables. And I can do basic inserts and updates. But if you start asking me how transactions work, what the best connection pool policy is, or anything more advanced than that, I will just give you a blank stare. Because I use SQL so infrequently, even simple things like updating a record in a table(s) requires me to look up the command.

My team lead at work has said that people he's trained either get relational DB or they don't. Apparently, I "get" it, but it still seems a bit awkward to me. Especially the syntax of SQL. I mean a non-case sensitive language (yeah yeah, LISP is case in-sensitive)? Also, SQL isn't Turing complete. You don't have branches or conditionals for example. I also had trouble modeling a tree type structure in a classic SQL database.

So out of curiousity, I started looking at NoSQL. NoSQL is a family of non-RDBMS style database technologies, including key-value stores (Redis), document databases (CouchDB and MongoDB), column family (Hadoop, BigTable), and graph databases (Neo4j, OrientDB) among others. The graph databases seemed to perk my interest for some reason, so I decided to investigate them.

As a consequence, I've been looking at Neo4j and OrientDB. Essentially, graph databases are, well....graphs :) You have Nodes (Vertexes), Edges (Relationships), and properties. As I understand it, these graphs are all directed, but the direction can go either way. What interested me about this style of modeling data is that the relationships are inherent in the graph itself. In other words, the traversal of the graph to get from one Node to another Node is your relationship. I immediately started seeing the advantages of this.

For example, imagine installing software. This is a graph structure, and in fact, it is a directed acyclic graph (DAG). Let's say you need to install a python script that requires sqlalchemy. What would this look like?

The edges are semantic notation, but notice how they are a relation? By traversing the graph, we can see how to install script.py. If the relation is either REQUIRES or INSTALLS_BY, then these are dependencies. I could have set a property inside either the edge or the node as to an installation rule. For example, in the pymysql node, I could have a property like:

install_rule: 'pip install pymsyql'

And for sqlalchemy, it would be the same:

install_rule: 'pip install sqlalchemy'

But what about python itself? or Pip? Either the graph could be extended to contain its own INSTALLS_BY edge, or the logic of the traversal could say that "if this is an end Node, look up the property 'install_rule', and use that for installation.

Well, that was just a thought experiment, but it looks way easier to do that than implement the equivalent in a SQL Table (or two). Not to mention that a graph data is many orders of magnitude faster (from what I've read, it's on the order of 1000x to 1500x faster).

Neo4j seems to have a cleaner architecture and is better documented, but it is GPLv3 only. OrientDB seems to have many more features (it is both a document db and a graph db, and it also has a subset extension of SQL to look up relations), but it does use a more permissive Apache license. Another advantage I noticed in Neo4j is that it is thread-safe...multiple threads can access the same graphdb object (probably because neo4j enforces that all mutation must occur inside a transaction).

Sadly, all the different Open Sources licenses really has me confused, and I'm not sure if my project uses some parts that are under EPL, some parts that are under BSD, and some parts under GPLv3, if that is even possible.

Tuesday, October 18, 2011

Using emacs and leiningen on windows for clojure: pt. 1

I have three books on clojure currently, Stuart Halloway's Programming Clojure (1st edition), Amit Rathore's Clojure in Action, and Michael Fogus and Chris Houser's The Joy of Clojure. Unfortunately, none of these books really tell you how to get started. They talk about the language of course, but setting up your development environment is not really covered in any of these books.

There is the pretty good reference site on getting started with any of the major IDE's available for clojure. I decided I would learn emacs, since it's something I've been meaning to do for a long time. I have used Eclipse counter clockwise plugin, but for some reason, I just didn't really feel like using it. Likewise, I didn't like how Enclojure forced you to use maven for your clojure project. I didn't play with Idea's La Clojure plugin, so it may be worth investigating. And for quick down and dirty experimenting, the clojure shell for jEdit really isn't that bad.

But as I mentioned, I wanted to use emacs. Because I have a brand new motherboard that uses UEFI, I have had problems finding a linux distro I can install on it (I tried Mint and Sabayon, and neither of them liked my new hardware). So I decided to run emacs on Windows. Also, I might be able to help people who are also using emacs on windows, and run into trouble when the documentation clearly only gives instructions for linux users.

Before I go too much further into emacs, let's install a different piece of software for your clojure development which we will use with emacs: leiningen.

Setting up leiningen

Leiningen is a project management tool for clojure, much like Maven is for java (at least until polyglot maven comes out). Indeed, leiningen uses some maven for dependency management behind the scenes. But instead of using XML as the declarative language for your project, you use clojure itself! Leiningen offers many nifty features, including but not limited to:

Dependency Management
Can perform AOT compiling if indicated
Can launch a REPL
Can start a swank session

For now though, let's just install leiningen. It's a little easier for linux users since every distribution I know of comes with either wget or curl. For windows, you'll need to grab one of them. I used curl myself. Both are just .exe files, and they don't have or require installers. Just make sure you put the .exe file somewhere in the system's %PATH%

Once you have wget or curl in %PATH%, you can install leiningen. Grab the stable leiningen for windows, and put the lein.bat file somewhere in %PATH% again. Once this is done, run the following command in a console:

lein self-install

If you are behind a proxy, make sure you read the directions on the leiningen site about how to specify your proxy. Otherwise, we are done for now.

Python and Clojure comparisons: functional programming

Doing these comparisons actually helps me learn clojure since I'm more familiar with python. I figure that this is probably true for a lot of other people, so hopefully this will help others learn clojure as well. OTOH, what I will cover here is more of a functional style of python programming that people might not be used to. So I hope this will also help people figure out the functional style of programming along with me. And as a warning, I am no functional guru. I'm a guy whose first language was C++, then C, then python, then Java...and I forgot the order of everything else. I've only recently begun programming in a functional style. But I hope that since I am new at this, I'll be in the same boat as others trying to learn this way of programming as well.

So in this entry, I'll cover some functional style programming between the two languages. But before I go too deep, let me begin with a brief introduction on my own understanding of functional programming. As the name suggests, functional programming puts functions as first class citizens. This means that functions can be passed as arguments to functions, or they can be returned as arguments. Functions can also "close over" data inside that function, and this is what the term closure means (and where the play on words, clojure, comes from) . Functional programming also takes after math. In math, functions take parameters belonging to some domain, and map an element from the domain to a range. Therefore, functions in functional programming languages seem to frown upon taking no arguments and returning void (returning nothing). Imagine a mathematical function:

nil = f()

What would that even mean? But now I'll go over some common ground between the two languages.

In clojure, it is often idiomatic to use the functions map, apply, reduce, or filter to solve problems. When I first was learning clojure, I had some difficulty in understanding the difference between the first three of these functions.

The map function takes as parameters, a function, and one or more collections. The number of collections passed in depends on the function that is passed in to map. If the function takes 2 arguments, then you pass in 2 collections. If the function takes 3 args, then you pass in 3 collections, etc etc. The map function will take the 1st element in each collection, and pass them as arguments to the function we are using. Here's a trivial example in clojure:

(map #(+ % %2 %3) [ 1 2 3 ] [4 5 6] [10 20 30])

Notice the #(+ % %2 %3) anonymous function. This is idiomatic clojure's way of defining an anonymous function. However, I could have also done this:

(fn [ f s t ] (+ f s t))

Anonymous functions in clojure are similar to lambdas in python but more powerful. Lambdas in python can only contain a single expression, thus limiting their power. The equivalent lambda in python would look like this:

lambda x, y, z: return x + y + z

But getting back to clojure's map function, the return of map is a lazy sequence containing all the mappings of collection values passed to the function inside of the map. One key to remember here is that we return a lazy sequence, and as such, sometimes it might be necessary to iterate through all values (using a doall).

So how about python's map? Python's map is similar except that it doesn't return a lazy sequence (a generator in python terms). The equivalent python code of the above would be this:

map(lambda x,y,z: x + y + z, [1, 2, 3], [ 4, 5, 6], [10, 20,30])

In python, you might be tempted to do this instead and it would return the same result:

[ x+y+z ; for x,y,z in zip([1,2,3], [4,5,6], [10,20,30]) ]

However, this has a disadvantage compared to the clojure way. The disadvantage is that clojure's map returns a lazy sequence, but neither python's map, nor its list comprehension is lazy. This means that for huge collections, you may run out of memory. However, there IS a way to create a lazy sequence in python, but don't use map() and don't use the list comprehension above either. Imagine if I had done this:

def increment(start=0, inc=1):
  i = start
  while True:
    yield i
    i += inc

def get_item(gen, i):
    res = gen.next()
    x = 0
    while x != i:
        res = gen.next()
    return res

needed_value = get_item(increment(), 100000) ## if you have a huge sequence

By calling get_item(), it will only return the value of gen at some iteration. The advantage to this style is that since it is using a generator, it is not consuming the entire list in memory. If i used a generator expression using for example xrange, and used list() to realize he values, this could be a lifesaver. This same concept applies to clojure's lazy sequences. Rather than hold the entire sequence in memory, a lazy sequence only holds as much as you require. Sometimes, when you need to iterate over the whole sequence, you have to force the evaluation using doall.

Update:
So python also has generator comprehensions as well as list comprehensions. Although the code above shows how a generator is manually created, there's a simpler syntax.

for term in (x+y+z ; for x,y,z in zip([1,2,3], [4,5,6], [10,20,30])):

    print term

Notice the use of () parens instead of [] that list comprehensions use. The above code would create a generator and the for term in would operate on this generator. This way you get the advantage of a lazy evaluator.

So that covers map...what about reduce? Perhaps you've heard of the map/reduce algorithm used by projects from Google or Hadoop. We've just seen above what map does. It applies values from one or more collections, and passes them to a function, and the return values are put inside of a sequence. The reduce method is similar, but it takes a specific kind of function. This function must take 2 arguments. Also, reduce only takes one collection. So you will often see the result of a map, applied to a reduce since the map call returns a collection, and reduce takes only one collection. The result of a reduce is a scalar, not a collection of some sort.

As its name suggests, reduce reduces a collection one by one in order to arrive at a scalar value. The classic example of reduce is a summation. For example, you can sum the series of numbers like this:

(reduce #(+ % %2) (take 10 (iterate inc 1)))

That would sum the numbers 1 - 10. If you're not familiar enough with clojure yet, I'll break it down. The inner (iterate inc 1) returns a lazy (infinite) sequence. The inc function adds 1 to the argument passed in. So (iterate inc 1) is saying, "create an infinite sequence starting at 1, and each succeeding element is 1 + the last element". In other words (1 2 3 4 5 ...). The take function takes a number of elements to grab from a sequence. So (take 10 (iterate inc 1)) says, "give me the first 10 elements of the natural numbers, and return them as a sequence". This is good since reduce requires a sequence as its second argument. The #(+ % %2) is an anonymous function that simply adds the two arguments together. The % and %2 are placeholders for arguments.

When reduce is called the first time, it will take the first 2 elements from the sequence, and pass these to % and %2 respectively. The result of #(+ % %2) is used as the incoming argument to % on all succeeding calls, and the next element in the sequence is used for the %2 argument. So for example, the first several calls would look like this:

(+ 1 2) ;; returns 3
(+ 3 3) ;; returns 6
(+ 6 4) ;; returns 10
(+ 10 5)
...

So hopefully that explains map and reduce, but what about apply? Actually, python doesn't really have a version of apply...at least not what clojure considers an apply. Since I can't really compare, I'll move on to something else.

A common scenario is creating a vector composed of the same index element in all the other vectors. For example, given these arrays:

a = [ 1, 2, 3]
b = ['a', 'b', 'c']
c = ['first', 'second', 'third']

I want to create a set of vectors like this:

d= [[1, 'a', 'first'], [2, 'b', 'second'], [3, 'c', 'third']]

Python has a built in function called zip() which does this for you.

d= zip(a, b, c)

The only difference is that python will return a tuple rather than a list of the newly created zip. For clojure, you may think the equivalent is zipmap, but it's not. Clojure's zipmap only takes two sequences, it doesn't take more (or less) than that. So how do you pass in an arbitrary number of sequences as we did with python's zip() function? One solution is a map.

(map vector (iterate inc 1) ["a" "b" "c"] [:first :second :third] )

Try passing that into the REPL, and notice what you get. You actually get a sequence, not another vector. This is something to watch out for in Clojure. Test it out by wrapping the call above inside of (class ).

So how about iterating through things? Is it true that in Clojure (and other loops) you always have to do recursion? Well, not exactly. But since this is a rather complex topic, I'll save that for a later blog :)