Under A Boddhi Tree: June 2013

Sunday, June 30, 2013

So, how do you say "kowtow" in Mandarin?

So, on Thursday my order for Fluenz Manadarin 1+2+3 came in and I started it yesterday. So far, it's pretty interesting, but I'm only on lesson 3 so far. This is the first long (2hr) lesson and the first two were relatively short. It's a bit expensive at $358, but I purchased it thanks to a rather large reward I got at work. I had to do a bit of deciding between it and Rosetta Stone, but the reviews were hands down in Fluenz's favor.

One rather big difference between Fluenz approach and Rosetta Stone's approach is that Rosetta Stone teaches by the immersion method, which seems to be good for children. However, Fluenz believes (from indications in neurological science) that adults lose the "plasticity" of the brain that makes children such rapid language learners. Unfortunately, when I was growing up, the prevailing attitude was that it was harmful to teach a child multiple languages as it may confuse them. This is entirely backwards, and the best time to learn a language is as a child. So Fluenz took a different approach which for lack of a better term is the "explanation" approach. Rather than lots of memorization or showing lots of pictures (immersion method), basically things are explained to you as you go along. So far, I've found their approach fairly easy.

The biggest criticism of the Fluenz program is it's lack of teaching of Chinese ideograms. I remember when I took Japanese in college, by far the hardest part was learning the Kanji (the Japanese version of Chinese ideograms). Learning the stroke order and radicals was no fun and could literally take up 2+ hrs a day in homework. In Fluenz, everything is in Pin Yin, which is a romanization and phoneticization of the characters. In other words, it is teaching you to speak and understand, but not really read or write. But I think that's ok for now.

I probably should have learned Japanese or even Tagalog, but I think Mandarin is going to be a very important language in the next few decades. And who knows, maybe I can find a job in Hong Kong or Singapore. You never know, maybe I could even work in Shanghai or somewhere in Taiwan (yeah yeah, I know, Taiwanese mostly speak Hokkien, and most people in HK speak Cantonese). Regardless, knowing Chinese will open up some opportunities in my career.

It may even open up some possibilities in my understanding of Buddhism. Many Buddhist sutras were only saved in Chinese when Buddhism was essentially chased out of India. It's unfortunate that many people don't understand how much of the meanings of their religious scripture(s) has changed due to elements being lost in translation. This is especially bad for the Bible, which went from Aramaic to Hebrew to Greek to Latin to English. You almost have to wonder how much of the translation is left. For Buddhism, I'd like to re-read some of the Sutras in the Chinese form and see what it is like. Of course, that will require reading comprehension which this program doesn't teach. Still though, it'd be cool if I could read Buddhist sutras in Chinese (and from what I understand, even though Cantonese and Mandarin are pronounced differently, they can each read and write the same).

I consider myself pretty good at learning new computer languages, so let's see if this applies to human languages as well. The one thing that worries me is that I don't consider myself all that good at memorization. In fact, I have a pretty horrible (short term) memory. I've always said I was extremely grateful that almost all the classes in my Computer Science curricula were open note. I think I am pretty decent at learning new concepts, but my memory is not so hot. And of course, learning new vocabulary is essentially 100% memorization. The thing with human languages is that you can't think like you do in a programming language. In a human language, you don't translate from one language to another. I always used to say in French, "Je dois traduire les mots francais dans les mots anglais. Je ne pense pas vraiment en francais". Basically, you shouldn't translate the words in your head from english to the other language and vice versa. It has to come naturally and without an internal translation. That requires more than memorization, it requires making it a part of yourself...and for me, that's difficult.

The other thing I am a little fearful of is pronunciation. Mandarin is a tonal language with 4 tones. Basically this means that the meaning of a word changes by how it is pronounced. I have been complimented before on my french pronunciation, so I am hoping I have a good ear for understanding. However, speaking the proper tones back might be a little harder.

Sunday, June 9, 2013

A little foray into induction and persistent data structures.

Part of the reason I am working on the C++ Data Structures and Algorithms project is to get back to the basics and master them. Some recent setbacks have caused me to re-evaluate my knowledge of Computer Science fundamentals. My two previous posts on C++11 and data structures is going to be an ongoing commitment from me in this regards.

Unfortunately, working at the hardware level has eroded a lot of my computer science fundamentals. In the linux kernel for example, you almost exclusively work with linked lists. The scheduler uses red black trees for fairness, but I have never had to touch the scheduler. And for some surprising reason, there is nary an associative array (hash, dictionary, map, etc) to be found unless you're working with the filesystem. And in firmware, good luck. You're pretty much stuck with arrays (used as stacks, queues or ringbuffers) and maybe some binary search trees or a priority queue. The only time in my 6.5 year career where I've extensively worked with more advanced data structures was in creating a dependency resolver for installing software for test automation.

While my basics in Computer Science might have eroded, I see many otherwise smart people who lack curiosity. It is my curiosity that drives me to learn new things. I think curiosity is something you either have or don't. Smart people might pick up a skill more quickly, but few I think have the drive to learn more and dig deeper. I am constantly wanting to learn new skills, and perhaps that is my downfall as it tends to distract my attention. That is why I am going to stick to getting these data structures and algorithms down.

I recently had two little confidence boosters. The first was solving a problem while reading a Python book on Algorithms by Magnus Lie Hetland. At the end of a section on Induction, it asked to solve Euler's Formula regarding vertices, edges and faces using induction. The equation is:

V - E + F = 2

V and E are vertixes and edges respectively, while F is a face (or field). You can think of a field as "space" captured by edges and the "universe" outside all the vertices and edges. Take for example this simple graph:

It has V = 2, E = 2 and F = 2 (the "universe" field, and the field captured by the space partitioned by the edges/vertices). And as you can see, Euclid's equation holds (2 - 2 + 2 = 2). But how do you prove that inductively? The equation V - E + F = 2 doesn't look like your typical equation since there are 3 variables!!

Or are there? If you increase the vertex count by one, you must also increase the edge count by one. Adding one vertex and one edge does not increase the field count because it is not creating a new one. Therefore we are adding one vertex but canceling it with a new edge:

V - E + F = (V+1) - (E+1) + F = 2

What if we add a new edge instead of a new vertex? If a new edge is added, then we also must create a new field. Again, we are subtracting an edge but adding a field:

V - E + F = V - (E+1) + (F+1)

And since there is no way to add a field without adding an edge, we have all our bases covered. I was actually wracking my brain on this one for about 30min and decided to let it rest. I was cooking dinner when I just had that eureka moment and was like, "oh yeah, that's the answer". I don't know of any engineer or scientist who has not had one of these Archmidean Eureka moments. It's kind of freaky when you think about it...the answer just comes out of nowhere. While these "out of the blue" answers are ultimately a good thing, it sucks during job interviews when you can't think of an answer and then it only comes to you later.

My second confidence boosting moment was when I realized one way to create a persistent tree structure. Persistent data structures are more common than you may think. Take for example a revision control system like git. You can envision your code edits as nodes in a tree. When you create a new branch, you aren't just creating a "branch on a tree". What you are really doing in fact is replicating the old tree, and sticking a new branch on the new tree. Why do that?

If you mutate your original tree, how do you go back? What if you delete a branch far up in the "arm" of the branch, but later, you wanted a twig hanging off the branch? If you have a mutable tree, it can't be done because the branch containing the "twig" is gone. So how do you solve this? What you do is you create a "new" tree, and it is this new tree has new branches added or pruned off. When you do your merge, you "graft" or 'prune' the branches from this "new" tree to the original. I had a lot of trouble grokking functional programming and immutable data structures until I read an analogy in the book The Joy of Clojure by Michael Fogus and Chris Houser. They analogized functional programming languages with immutability to those flip books that you see to create the illusion of movement. In immutable languages, every page is a "version" of the data. In mutable languages, you have one piece of paper, and every time you want to change something, you have to erase the picture and start over. There's no "versioning".

Now, that might sound like a huge performance hit both in terms of speed and size. Creating a duplicate tree wastes space not to mention copying a whole bunch of stuff around. But that's a naive approach. If you think about it, you don't have to duplicate the entire tree. If you think about inserting a node into a tree, what do you first have to do? You first have to traverse the tree to find the node insertion point. By doing this, you have already walked a branch of the tree...and that's all you need to duplicate (in order to get to the new node to insert). In fact, you don't even have to "duplicate" this branch (which is for all intents and purposes, a linked list of nodes). In other words, you don't need to copy the branch, you just need to make pointers to the nodes making up the branch.

You can create a (shared) pointer for each of the nodes you have walked through. Once you get to the node where you will insert the new node, you create two new nodes. You create a new node for the insertion point, and a new node for the actual node to be inserted.

This requires log n new shared pointers (not actual node objects, just pointers to the nodes), and requires creation of just 2 new nodes (versus mutating one for a mutable tree). This is much better than duplicating an entire new tree of n nodes. You now have 2 trees, the original and a new tree which is exactly the same as the original with the exception of the new node. And now you get all the advantages of an immutable and persistent tree data structure. Look at this graphical example:

Basically, what you are "duplicating" is a linked list that goes to the newly inserted node. The green circles represents new pointers to the already existing objects. Now you have two "heads", one in blue, and one in green. The blue head does not have a node with a value of 11, but the green head does. Since this was a small tree, the number of new nodes is fairly large in proportion. On average, you will only need log n + 1 new nodes. I have read that there is a way to implement this with O(1) new nodes, but I have not studied this implementation.

C++11 multi-threaded data structures and algorithms using waf: Part 2, waf build system

This is part 2 of my C++11 data structures and algorithms blog series. You can see the first part here. I have also finally uploaded the code to my bitbucket account which you can see here:

https://bitbucket.org/Dauntless/shi

So as I mentioned in the previous post, I have been writing a small C++ library for data structures and algorithms. The only C++ I've done recently was to make a little ioctl library for one of the drivers at work, but it wasn't that much code. Also, I really want to get into the habit of doing Test Driven Development. I actually went out and bought two books on TDD both by Pragmatic Programmers:

Modern C++ Programming with TestDriven Development: Code Better, Sleep Better by Jeff Langr
Test Driven Development for Embedded C by James W Greening

One of the first things I've been tackling is a build system. I think one of the most confusing parts to building native apps is the build system. Makefiles aren't really taught all that well in school, nor for that matter is the idea of logically breaking up a large project into modules. No matter the build system, they are either ugly and super hard to debug when you don't get them right, or are a mysterious black box that you have no clue what it's doing. And there are a ton of build solutions out there, so which one should you use anyway? Do you care about cross-platform capability? Ease of use? Ability to debug?

I considered learning CMake as it seems to be pretty popular nowadays, but I decided on learning waf instead. Why waf? Because waf is just a set of python libraries, and since it's python, that means it is Turing complete. Having a build system that is also a library offers several advantages.

For example, my project requires the Google Mock framework. I am writing the waf build script so that if it doesn't find the Google Mock library on your system, it can download and install it for you. Try doing that with a Makefile, Visual Studio Solution, Eclipse project, or CMake. I could also programmatically run the GMock tests upon the finish of a build. In my opinion, it's probably the closest a C(++) build system can get to a Java maven build. Admittedly, with more elbow grease required, but at least it's possible.

The waf build system is actually relatively easy to pick up...for the basics at least. Like most build systems, it breaks up a software project into different tasks. For example, there's a configuration phase where you can set up various compiler options or dependency checks, a build phase to actually generate the binaries, and an install phase where you can install the binary to the user's system. You can also generate your own commands, and this is what I will do to hook the build with running the GMock tests.

The key concept to understand is that waf uses a wscript file as the build script. This is a python script but without a .py extension. Because the script is being called by waf, you don't even have to import anything.

My actual project directory looks like this:

/home/sean/Projects
/shi
/src
/gui
/algos
wscript
/include
/templates
wscript

Wait! why are there two wscripts? Normally, you only want one entry point for your build script, but it may not make sense to have to change into a particular directory to run the build script. For example, the configuration doesn't need to be in a source directory. Notice the wscript I displayed above has a function called build(), and this function calls bld.recurse(). That's where the second wscript comes in. When you call the recurse() function, the parameter is a directory, and it will call the same function in the wscript as the one from which recurse was called. Since recurse() is being called from the build() function, it will look for

./src/wscript

and call the build() method defined in _that_ wscript. So that being said, let's look at the wscript in the toplevel project directory folder.

'''
This is the waf build script for shi
'''
import os
import urllib2

top = '.'
out = "build"


def notImpl(fn):
    def wrapper():
        print "TODO: {0} is not yet implemented".format(fn.__name__)
    return wrapper



def find_gmock(ctx):
    '''
    Find the gmock/gtest library
    '''
    print "trying to find Google gmock"
    if "GMOCK_HOME" in ctx.env:
        print "GMOCK_HOME is in env"
        return True

    has_gmock = False

    if ctx.options.gmock and \
       os.path.exists(ctx.options.gmock):
        has_gmock = True
        ctx.env['GMOCK_HOME'] = ctx.options.gmock
    else:
        print "ctx.options.gmock is ", ctx.options.gmock
    
    
    if not has_gmock:
        getGmock()
        ctx.fatal("Could not find gmock/gmock.h")
    
    return has_gmock



@notImpl
 def getGmock(version):
     '''
     Will retrieve the google gmock source 
     '''
     pass
 
 
 
 @notImpl
 def find_boost(ctx):
     '''
     Searches for boost library
     '''
     pass
 
 
 
 
 def configure(ctx):
     HOME = os.environ['HOME']
     has_gmock = find_gmock(ctx)
     if has_gmock:
         ctx.env.GMOCK_INC = ctx.env.GMOCK_HOME + "/include"
         ctx.env.GMOCK_LIB = ctx.env.GMOCK_HOME + "/lib"
         ctx.env.GTEST_HOME = ctx.env.GMOCK_HOME + "/gtest"
         ctx.env.GTEST_INC = ctx.env.GTEST_HOME + "/include"
         ctx.env.GTEST_LIB = ctx.env.GTEST_HOME + '/lib'
 
     ctx.find_program(ctx.options.compiler, var="CLANG", mandatory=True)
     ctx.env['CXX'] = [ctx.options.compiler]
     ctx.load("compiler_cxx")
     
    
 
 def options(conf):
     conf.add_option("--compiler", 
                     action="store",
                    default="clang++",
                    help="compiler to use (clang++ or g++)")
     conf.add_option("--gmock",
                     type="string",
                    dest="gmock",
                    help="location of google gmock library or GMOCK_HOME env var")
     
                    
     conf.load("compiler_cxx")
 
 
 
 def build(bld):
     bld.recurse('src')  ## call build from ./src/wscript

Hopefully all of the above seems reasonable. But if not, in essense the waf system needs to configure the environment in which to do the build, including checking for dependencies, and it has to actually perform a build. The configuration of your project is done in the configure function. When you actually run this phase you would call it like this:

./waf configure

Or, if you do not have gmock in a standard library location, you could run the configure stage like this:

./waf configure --gmock=/home/sean/Downloads/gmock-1.6.0 --compiler=g++

Currently, my build doesn't download and install gmock if you don't have it, but that's one of the nice things about waf. If you wanted to, you could. There are some other things it doesn't check for, like CMake (to build gmock), or actual verification of the directory passed in from --gmock. Creating a dependency system is no small task (trust me, I've done it before), but it's an interesting one involving some fun computer science
skills (graph traversals, transactions, and cyclic detection for example).

The configuration is very similar to running ./configure in a typical autotools C(++) program. It makes sure that your build environment has all the required development tools, and sets up any environment variables required for building.

Once your system has been configured, you'll obviously want to do a build. This is the step that actually generates your binaries. Technically waf can differentiate between debug builds and release builds, but I'm doing a release build here. In order to do a build, you just run the command like this:

./waf -v build

The -v is just a verbose flag, but it's useful if a compile fails. The actual binary gets generated in a folder that you specify in your wscript from a variable named out. It will generate sub-folders depending on the location of the recursively called wscript. For example, the first level wscript calls ./src/wscript. So the actual binary gets put into ./build/src. If you had your second wscript in ./source, the binary would be in ./build/source for example.

So speaking of the second wscript, what does it look like?

 top = '.'
 
 import os
 
 def build(bld):
     '''
     Generates the node_test executable
     '''
     PRJ_DIR = os.getcwd()
     INCLUDES = [".", "..", 
                 PRJ_DIR + "/includes", 
                 PRJ_DIR + "/templates",
                 bld.env.GMOCK_INC,
                 bld.env.GTEST_INC]
 
     LIBPATH = ["/usr/local/lib", 
                bld.env.GMOCK_LIB,
                bld.env.GTEST_LIB]
 
 
     bld.program(
        source = 'gtests/main.cpp',
        target = 'node_test',
        includes = INCLUDES,
        lib = ['pthread'],
        libpath = LIBPATH,
        stlib = ['gtest', 'gmock'],
        stlibpath = ['/usr/local/lib'],
        cxxflags = ['-Wall', '-std=c++11'],
        dflags = ['-g'])
 
     bld.program(
        source = 'gtests/bintree_test.cpp',
        target = 'bintree_test',
        includes = INCLUDES,
        lib = ['pthread'],
        libpath = LIBPATH,
        stlib = ['gtest', 'gmock'],
        stlibpath = ['/usr/local/lib'],
        cxxflags = ['-g', '-Wall', '-std=c++11'],
        dflags = ['-g'])
 
     bld.program(
        source = 'gui/firstSFML.cpp',
        target = 'ogl-gui',
        includes = [PRJ_DIR,
                    "/usr/local/include"],
        lib = ["sfml-graphics", "sfml-window", "sfml-system"],
        libpath = ["/usr/local/lib"],
        cxxflags = ['-Wall', '-std=c++11'],
        dflags = ['-g'])

Hopefully, that doesn't look too bad if you're familiar with makefiles. Currently, none of the programs require other object files, mainly because I am using templates. Because I am using templates, I have to include the source rather than generate an object file that another object uses. It's just one of the limitations of templates, and perhaps somewhat unintuitively, I have made my templates .hpp files rather than .cpp files (since they are getting included into other source rather than being compiled and linked against other objects). If you do have objects to compile, you can use bld.object() instead of bld.program(), and other objects that use it would include a keyword argument called uses and the value would be target. For example:

bld.object(
source = "some_file.cpp",
target = "some_obj", ## generates some_obj.o
includes = INCLUDES + ["includes/some_file.hpp"],
lib = ['math'],
cxxflags = ['-Wall', '-std=c++11'],
dflags = ['-g'])

bld.object(
source = "another_file.cpp",
uses = ["some_obj"], ## it links against some_obj.o
cxxflags = ['-Wall', '-std=c++11'],
dflags = ['-g'])

Just as the make program builds binaries, sometimes you want to clean everything. In waf, you just specify:

./waf clean

And it will delete the contents (recursively) of your build folder. There's also a distclean, which also forces you to run configure again. This is because the waf build will cache some data to prevent rebuilding everything.

So that's it for now on waf. As I make improvements to waf, I'll cover them in a future blog post.