Under A Boddhi Tree: EBNF and lexer time

So, what's the first step in making a language? Honestly, I don't know :) I'm kind of winging this as I go. But the starting point seems to be:

1) Decide on your features
2) Decide on your grammar
3) Make an EBNF to describe the language
4) Create a lexer
5) Create a syntax parser
6) Create a compiler/interpreter/VM

I have a rough feature set in place, so I think the next logical step will be to create the CFG and/or EBNF that describes the grammar. What's a CFG and EBNF? A CFG is a context free grammar. I can't go into too much detail, but if you pick up a book on formal languages, it should decribe what these are. The EBNF is the Enhanced Backus Naur Form which is a recursive description of the structure of a language. Given an EBNF, there are parser generators like ANTLR or BISON which can spit out parser generators for you.

But let me step back a moment. What is the difference between lexer and a parser? And what is a lexer anyway? One of the first things that a compiler or interpreter must do is recognize the tokens in a string. In fact, lexers and tokenizers are synonymous basically. A token is a discrete lexical unit (see why the two are the same?). Take for example this sentence. The words "Take", "for", "example", "this", and "sentence" would all be tokens. Does white space always delineate tokens? Not necessarily. And often, things other than white space can delineate tokens. Take this example:

int i = 2+3;

What are the tokens? [ int, i, =, 2, +, 3, ;]. But notice that no white space separates 2+3, and yet those are 3 separate tokens. This is why you need a lexer, and ideally you need a lexer generator like YACC. But how do you make a lexer? If you have used regular expressions before, this kind of looks like a regex doesn't it? But how do regexes work? Fundamentally, a regular expression defines a "regular" language (a language is regular because a regular expression can be written to recognize all strings producible by that language). In turn, regular expressions can be converted into NFAs (non-deterministic finite automata) which in turn can be converted into DFAs (deterministic finite automata).

If you are curious about regular languages, regular expressions, NFAs, DFAs, pushdown finite automatas and context free grammars, then I recommend you pick up a good book on Automata Theory. People familiar with finite state machines and graph theory really won't have any problem picking up on automata theory. To extremely oversimplify things, an NFA has states, one of which is a starting state, and one or more states (including possibly the start state) is an accepting state. States can be transitioned to by "consuming" an element in the string, or possibly not consuming anything at all (the epsilon transition). The "consumption" of the characters in the string leads you to a state, and once the string is fully consumed, either you are in an accepting state or not (or possibly a character in the string has no transition, in which case it's an unrecognized string). They are called non-deterministic, because each vertex (or state) can have more than one possible transition. A DFA on the other hand can only have one possible transition per state. So from a graph point of view, in a DFA, you have a directed graph with only one outgoing edge per vertex (but possibly many incoming edges).

Ok, so that's all well and good...but how do you actually PROGRAM one. Well, that's what I intend on doing :) In the next few blogs, I'll put up some of the code that actually performs the lexing for a scheme like grammar. Originally, I had thought about writing this in scheme, but since LLVM is written in C++, I might do it in C++ instead. But, I'll probably eventually do this in scheme anyway, just to get some practice with it. I'll also intersperse this with the beginnings of an EBNF for Shi. It seems to me like the EBNF is really the first thing I should be doing, as it will contain valid tokens that the lexer has to recognize.

Under A Boddhi Tree

Friday, May 11, 2012

EBNF and lexer time

No comments:

Post a Comment