This weekend I added variables and arguments to the LOGO interpreter for mobile phones I’m working on. This was much more difficult than I was anticipating.
Originally, I wrote the interpreter parser using the CUP LALR parser generator for Java. When I was evaluating whether to write the parser by hand, or use a parser generator, I decided having the ability to quickly and easily mess with the grammar was a big plus.
If you’re not familiar with parser generators like CUP or yacc, they turn compiler development into something fun. Instead of juggling finite automata and parser tables in your head, you just specify a language grammar, along with a set of semantic actions, and out comes a parser. In the case of my interpreter, the parser is outputting an abstract syntax tree (AST), which is a tree representation of the LOGO code. The interpreter then evaluates the program by walking the tree.
For example, consider the following CUP grammar rule for a REPEAT command:
loop ::= REPEAT INTEGER:n list:l
{: RESULT = new LogoRepeat((LogoCmds)l, n.intValue()); :}
;
This tells the parser that when it sees a REPEAT command in the token stream, it should emit a new AST node (LogoRepeat). LALR parsers are bottom up parsers. The AST is built up from the leaves to the root node.
So far so good. Using CUP I was able to quickly prototype the interpreter, and focus on the interface and device porting work. Then I tried adding variables and procedure arguments.
LOGO is a functional language based on LISP, except without the parentheses. LISP, aka Lots of Irritating Superfluous Parentheses. Turns out those parentheses aren’t that superfluous.
Consider the following procedure call in LOGO:
SUM SIN 90 3
This is fairly easy to parse, SIN takes one argument, SUM takes two, so the parse should look like this:
(SUM (SIN 90) 3)
The problem comes when we let users define their own procedures. How do we parse the following:
foo bar 1 2
There are three possible parses, depending on the number of arguments each procedure takes, assuming it’s a valid statement in the first place:
(foo (bar 1 2))
(foo (bar 1) 2)
(foo (bar) 1 2)
So, the correct parse depends on the number of parameters, or arity, of the function. Unfortunately, that information is not available at the level of the syntactic parser, it’s gained through semantic analysis. The LOGO language is not context-free (this is equivalent to example 4.12 in the first edition of the dragon book. Languages like LISP and C have those wonderful parentheses indicating the begin and end of procedure calls.
So, how do we fix this? One solution is to feed semantic information into the syntactic parsing stage. Procedures and variables are kept in a symbol table, which can be referenced in semantic actions while parsing to disambiguate the grammar. This complicates the simple mapping of the grammar to the AST, reducing the readability of it. I didn’t find a way of dynamically modifying the parser with CUP or JFlex, so I started exploring some other Java parser generators. Unfortunately, ANTLR uses several Java foundation classes that are not included in the MIDP 2.0 standard.
So, it’s back to writing the parser by hand. Which is actually a lot more fun than I first thought. Steve Yegge is right, if you don’t know compilers, you don’t know computers. For that matter, you don’t really know programming languages either. Compiler design touches every aspect of computer science. It also changes the way you think. Programs stop being static objects you write, and turn into dynamic evolving structures.
This is exciting! I’m writing a Logo interpreter in JavaScript to run in web browsers, and I’m running into the same problem. Or I was before I got sidetracked addressing the separate problem of long scripts hanging the browser. My Logo project was my reason for writing this:
http://www.tumuski.com/code/clumpy/overview/
I too had tried the LALR route using JSCC (http://jscc.jmksf.com/), but quickly realized the same thing you did. My next thought, since there are so many flavors of Logo out there, was to try to invent a new context-free Logo syntax.
One idea I had was to limit all functions to one argument, but let that argument be a list where necessary. So you’d write:
SUM (SIN 90 3)
SUM takes one argument: a list with two elements: SIN 90 and 3. I thought about also adding some syntactic sugar to allow user functions to treat the list elements as separate arguments.
E-mail me if you’d like to collaborate on this. I plan to either license my interpreter under the MIT license, if that makes a difference.
Pingback: Logo in JavaScript — Tumuski