Friday, May 11, 2012

EBNF and lexer time

So, what's the first step in making a language?  Honestly, I don't know :)  I'm kind of winging this as I go.  But the starting point seems to be:

1) Decide on your features
2) Decide on your grammar
3) Make an EBNF to describe the language
4) Create a lexer
5) Create a syntax parser
6) Create a compiler/interpreter/VM

I have a rough feature set in place, so I think the next logical step will be to create the CFG and/or EBNF that describes the grammar.  What's a CFG and EBNF?  A CFG is a context free grammar.  I can't go into too much detail, but if you pick up a book on formal languages, it should decribe what these are.  The EBNF is the Enhanced Backus Naur Form which is a recursive description of the structure of a language.  Given an EBNF, there are parser generators like ANTLR or BISON which can spit out parser generators for you.

But let me step back a moment.  What is the difference between lexer and a parser?  And what is a lexer anyway?  One of the first things that a compiler or interpreter must do is recognize the tokens in a string.  In fact, lexers and tokenizers are synonymous basically.  A token is a discrete lexical unit (see why the two are the same?).  Take for example this sentence.  The words "Take", "for", "example", "this", and "sentence" would all be tokens.  Does white space always delineate tokens?  Not necessarily.  And often, things other than white space can delineate tokens.  Take this example:

int i = 2+3;

What are the tokens? [ int, i, =, 2, +, 3, ;].  But notice that no white space separates 2+3, and yet those are 3 separate tokens.  This is why you need a lexer, and ideally you need a lexer generator like YACC.  But how do you make a lexer?  If you have used regular expressions before, this kind of looks like a regex doesn't it?  But how do regexes work?  Fundamentally, a regular expression defines a "regular" language (a language is regular because a regular expression can be written to recognize all strings producible by that language).  In turn, regular expressions can be converted into NFAs (non-deterministic finite automata) which in turn can be converted into DFAs (deterministic finite automata).

If you are curious about regular languages, regular expressions, NFAs, DFAs, pushdown finite automatas and context free grammars, then I recommend you pick up a good book on Automata Theory.  People familiar with finite state machines and graph theory really won't have any problem picking up on automata theory.  To extremely oversimplify things, an NFA has states, one of which is a starting state, and one or more states (including possibly the start state) is an accepting state.  States can be transitioned to by "consuming" an element in the string, or possibly not consuming anything at all (the epsilon transition).  The "consumption" of the characters in the string leads you to a state, and once the string is fully consumed, either you are in an accepting state or not (or possibly a character in the string has no transition, in which case it's an unrecognized string).  They are called non-deterministic, because each vertex (or state) can have more than one possible transition.  A DFA on the other hand can only have one possible transition per state.  So from a graph point of view, in a DFA, you have a directed graph with only one outgoing edge per vertex (but possibly many incoming edges).

Ok, so that's all well and good...but how do you actually PROGRAM one.  Well, that's what I intend on doing :)  In the next few blogs, I'll put up some of the code that actually performs the lexing for a scheme like grammar.  Originally, I had thought about writing this in scheme, but since LLVM is written in C++, I might do it in C++ instead.  But, I'll probably eventually do this in scheme anyway, just to get some practice with it.  I'll also intersperse this with the beginnings of an EBNF for Shi.  It seems to me like the EBNF is really the first thing I should be doing, as it will contain valid tokens that the lexer has to recognize.

I'm a developer again

It's official: in the next few weeks, I'll be doing linux driver development at my company.  I hope I can bring some of my testing experience into this position, and having had this experience, I can definitely say that all developers should have a testing background, and all testers should have a development background.

While reading some scheme paper (I can't recall which one), there was a quote by Richard Feynman where he said, "What I cannot build, I do not understand".  I think this is really something that has to be understood.  It is in fact why I studied Computer Science and not Electrical Engineering.  A long time ago I read a saying describing, in a nutshell, the difference between scientists and engineers: "Scientists build in order to learn.  Engineers learn in order to build".  By that criterion, I am most definitely a scientist.  I want to build things so that I understand them.  My end goal is not actually whatever I built, but what I learned.

So getting back into driver development will help me understand better how operating systems work.  Creating my own language will help me better understand the theory of computation.  Implementation is, in my eyes, a necessary evil; a means to an end, but not the end itself.  Without rolling up your sleeves and getting your hands dirty, you won't really understand something.

This is also the key to Buddhism.  People are often surprised when I tell them Buddhism is not a religion or a philosophy.  To the western mind, this doesn't seem possible.  So people ask me if Buddhism eschews beliefs, is not a religion, and also distrusts concepts ( and is thus not a philosophy) what could Buddhism possibly be? When I say that a true Buddhist is a mystic, most are truly confused.  What is a mystic you ask?  A mystic trusts only his experience and awareness.

A Buddhist doesn't ruminate, or contemplate.  The only way to "know" is to be aware.  It is the acting of "being", and simply being conscious of this moment.  The only way to "know" life is to 100% fully be in it.  One doesn't "get" life by simply regurgitating what prior masters said.  The only way to know life is to roll up your sleeves and live it.  It is not to be gained through mental fortitude, nor steadfast belief.  This is no different than saying one "knows" math by reading a book on it.

So although I am nervous about dealing with customers again (that's definitely one nice thing about being in Test, we don't deal with clients directly), this is really something I needed to do.  I am looking forward to digging deeper, and being able to say, "what I have built, I understand".

Saturday, May 5, 2012

How do you make a language? Good question...

So here I am, trying to figure out how to make my own scheme-like language, but I am not really sure where to start.  I actually never took compiler theory in school, but Automata Theory was a required class.  But even though I did take automata theory, that was many moons ago, and I have since forgotten a lot about it.  I mean, what DOES it take to create a language anyway?

How much do I need to know about lexers and tokenizers?  Does my language have to be understood by a LALR parser?  An LL parser?  And how do I make a parser anyway?  Do I have to use something like YACC and Bison, or maybe ANTLR?  What about my EBNF forms, how do I know they are complete?  Does the language have to be a context free grammar?  What does context free mean anyway?  Is a context free language different from a regular language?

And these are really more just grammar production and syntax questions.  What about creating control forms, concurrency support, tail-call optimization, etc etc that are features of the language?  Where does one even begin when trying to design a language?

That's why I've decided to look at the R6RS scheme reference as a starting point.  This language (which I am thinking of calling Shi, is the Chinese transliteration for the Pali word Vijnana which very loosely translates to "mind") won't actually be a scheme per se, but it will be scheme-like, just as clojure isn't exactly a lisp or scheme (however, the more I learn about scheme, the more clojure seems like a scheme, since it is a lisp-1 and it is more functional in nature).

However, that will only get me so far.  For example, what other features in the language need to be implemented?  What exactly is the goal of this language?  So I decided to list down some of the things I wanted to implement:


  1. Persistent data structures
  2. Lazy evaluation by default
  3. Dynamically typed by default, but with type hints
  4. Tail call optimized
  5. JIT'ed
  6. Support for continuation passing style
  7. Support for C FFI
  8. Some kind of concurrency support (debating between STM and message passing)
  9. Garbage collected
  10. lisp1 style lexically scoped with one namespace
  11. Hygienic macros only

Some features I'd eventually like to implement (but probably in libraries)

And that's for starters.  The design decisions above will impact the implementation.  From what I've read about LLVM so far, it looks like the LLVM IR will give me support for #5, #7 and #9 above.  It will also provide #4, but only on x86(_64) and PowerPC (but not the ARM...dammit).  #3 will be interesting, I'll have to think about how to do this (I'll probably peek at the Clojure source code, and see how they do this, as I think it's a pretty cool feature).  The immutable data structures will have to be provided at a somewhat higher level.  Although LLVM provides primitives for immutability, this is different from actually implementing persistent data structures.  Often, red-black trees are used to create associative arrays for example, but I still have to figure out how to make the structure persistent.

Even just figuring out what goes into making a programming language has been pretty fascinating so far, and I have only scratched the surface.  Ultimately, what fascinates me is the theory of computation itself, and I hope that creating my own language based on scheme will give me a greater insight into the lambda calculus and computation itself.

How does scheme do tail call optimization?

Last night, I was curious how to implement tail-call optimization for Shi (the language I am going to work on).  I was curious how current scheme implementations did this.  Since many schemes are implemented in C how do you do tail call optimization if C itself doesn't do tail call optimization?

But first, what does TCO really do anyway?  And why do so many C(++) programmers lambast functional style recursion?  I have to laugh when FW engineers at my company poo-poo recursion.  Unfortunately, engineers who are not familiar with other languages aren't even aware that it's not recursion that is at fault, but the lack of sophistication of the C(++) compiler.

The unenlightened think that all languages suffer stack overflows.  This is not true however.  In brief, when a function call is made, a stack frame is allocated on the call stack, and the call stack has a limited amount of memory.  One of the duties of a stack frame is to provide a return address so that as one function call completes, the stack frame is popped off, and the program can return to where the execution was left off.  So in recursion without TCO, a new stack frame is allocated for every function call, and this is why you can "blow the stack".  This is one reason why C(++) programmers (claim) recursion is so bad.  In truth, I think most imperative style language programmers simply don't want to wrap their brains around recursion (or if you think lazily, induction) .  The other mythical reason imperative style programmers claim recursion is bad is because procedure calls are expensive (because you have to push a new frame onto the call stack).  This too is erroneous.

They are myths, however they are correct from a C(++) point of view.  But again, don't make the mistake that recursion or function calls themselves are bad which may dissuade open-minded programmers from attempting to learn functional style programming if they only listen to their unenlightened peers.  For example, the D programming language DOES do tail call optimization.  But how did these myths come about in the first place?

In one part of the important "Lambda Papers" by the legendary Guy Steele called somewhat verbosely, "Debunking the 'Expensive Procedure Call' Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO", Steele explains why function calls are not expensive as believed.  As the title somewhat indicates, this gives a historical account over how function calls got a bad rap in the first place and gives an interesting perspective on the attitudes towards GOTO (from way back in the day).  I recommend people read this, as it is an interesting account.  But germane to this discussion, Steele debunks why function calls are considered "expensive".  Steele basically points out three things:

1) GOTO statements are "universal" control flow statements
2) GOTO's are cheap, because in machine code, they are just a branch or a jump (as opposed to a switch or a case for example which becomes many machine code ops
3) Procedure calls are in essence GOTOs that can pass in arguments

Given the above three, function calls are therefore "cheap" and also become control flow in their own right.  Interestingly, the paper mentions that stack space does not have to be consumed when using this "GOTO" method for tail recursion when lexical scoping is used (as opposed to dynamic scoping as in lisp).  I presume this is because when variables are lexically scoped, the stack frame itself carries the reference(s) to the variables, as opposed to having variables being passed around dynamically.  I could be mistaken on this point though.

So, TCO makes it possible to not consume a stack frame on every function call.  And although the paper hints at how to do this from an assembly point of view, that still didn't explain how schemes that use C as the Intermediate Represenation performs TCO, since C itself can't do TCO.  At first, I thought maybe they used setjmp/longjmp to save off the stack frame, and then on the recursive call use longjmp to go back.  The problem is the unwinding of the stack frames (which may point to no longer valid frames).  Still, it seems at least possible to do it this way.

I then came across something called a "trampoline", which I recognized somewhat from Clojure.  A trampoline in clojure can be used for mutual recursion, but the trampoline described here  is a function which "jumps" to other functions.  Also after reading this, I came across an abstract discussing how to use the heap to perform tail call recursions.

This is all pretty fascinating, and I wish I had read the lambda papers before.  I am even starting to read the legendary SICP book so that I can better understand Scheme.  I guess Clojure was kind of like the gateway drug into functional programming and lisps :)  It's even made me look a little at haskell....but first, I want to get Shi rolling.