Building a simple Parser and Lexer in PHP


Compiler design is a complex endeavor, but also one of the most satisfying projects you can undertake. Lately I’ve been interested in compiler and parser design; my interest piqued by Debasish Ghosh’s wonderful book, DSLs in Action. Web development in general provides a far less opportunity to work in the domain of compiler or interpreter design. So the idea of building DSL’s was the perfect excuse for learning parser design.

To design a good DSL it is necessary to have some compiler design knowledge under your belt. The classic Dragon book is usually the first choice of reference to pickup on compiler design, but it is too theory oriented, and what I needed was something to get me started with writing code quickly, rather than mull over automata theory, syntax trees and other compiler theory concepts. My first choice was Ronald Maks Writing Compilers and Interpreters. Although it is an incredible book, it is rather elaborate in its presentation. What I instead wanted was a cookbook style presentation, using which I could quickly design some working code, and only later refer to the above books for more in-depth understanding.

After looking around for a while I settled for Terence Parr’s Language Implementation Patterns. This is exactly what I needed – bit sized patterns on compiler and parser design with working code. The book provides a recipe style approach, gradually moving from simple to complex compiler/parser design issues. As I primarily work with PHP, I thought of porting some code to PHP to see how it works.

A simple Lexer

Below are some of the few recipes from the initial chapter of the book, on designing a LL(1) lexer and parser. The language we want to recognize is an extremely simple one – a list of names like [a,b,c,d,e]. A LL(1) lexer for the same is shown below.

test_lexer.php

<?php
 
require_once('ListLexer.php');
require_once('Token.php');
 
$lexer = new ListLexer($argv[1]);
$token = $lexer->nextToken();
 
while($token->type != 1) {
    echo $token . "\n";
    $token = $lexer->nextToken();
}
 
?>

Testing the lexer from the command line is given below with the language string passed as a argument. The lexer just outputs the tokens of the language, throwing an exception if a character outside the language is encountered.

D:\localhost\test\lexer>php test_lexer.php [a,b,c,d]
<'[',LBRACK>
<'a',NAME>
<',',COMMA>
<'b',NAME>
<',',COMMA>
<'c',NAME>
<',',COMMA>
<'d',NAME>
<']',RBRACK>

ListLexer.php

<?php
 
require_once('lexer.php');
 
class ListLexer extends Lexer {
    const NAME      = 2;
    const COMMA     = 3;
    const LBRACK    = 4;
    const RBRACK    = 5;
    static $tokenNames = array("n/a", "<EOF>",
                               "NAME", "COMMA",
                               "LBRACK", "RBRACK" );
 
    public function getTokenName($x) {
        return ListLexer::$tokenNames[$x];
    }
 
    public function ListLexer($input) {
        parent::__construct($input);
    }
 
    public function isLETTER() {
        return $this->c >= 'a' &&
               $this->c <= 'z' ||
               $this->c >= 'A' &&
               $this->c <= 'Z';
    }
 
    public function nextToken() {
        while ( $this->c != self::EOF ) {
            switch ( $this->c ) {
                case ' ' :  case '\t': case '\n': case '\r': $this->WS();
                           continue;
                case ',' : $this->consume();
                           return new Token(self::COMMA, ",");
                case '[' : $this->consume();
                           return new Token(self::LBRACK, "[");
                case ']' : $this->consume();
                           return new Token(self::RBRACK, "]");
                default:
                    if ($this->isLETTER() ) return $this->NAME();
                    throw new Exception("invalid character: " + $this->c);
            }
        }
        return new Token(self::EOF_TYPE,"<EOF>");
    }
 
    /** NAME : ('a'..'z'|'A'..'Z')+; // NAME is sequence of >=1 letter */
    public function NAME() {
        $buf = '';
        do {
            $buf .= $this->c;
            $this->consume();
        } while ($this->isLETTER());
 
        return new Token(self::NAME, $buf);
    }
 
    /** WS : (' '|'\t'|'\n'|'\r')* ; // ignore any whitespace */
    public function WS() {
        while(ctype_space($this->c)) {
            $this->consume();
        }
    }
}
 
?>

The main lexer file.
Lexer.php

<?php
 
abstract class Lexer {
 
    const EOF       = -1; // represent end of file char
    const EOF_TYPE  = 1;  // represent EOF token type
    protected $input;     // input string
    protected $p = 0;     // index into input of current character
    protected $c;         // current character
 
    public function Lexer($input) {
        $this->input = $input;
        // prime lookahead
        $this->c = substr($input, $this->p, 1);
    }
 
    /** Move one character; detect "end of file" */
    public function consume() {
        $this->p++;
        if ($this->p >= strlen($this->input)) {
            $this->c = Lexer::EOF;
        }
        else {
            $this->c = substr($this->input, $this->p, 1);
        }
    }
 
    public abstract function nextToken();
    public abstract function getTokenName($tokenType);
}
 
?>

Token.php

<?php
 
class Token {
    public $type;
    public $text;
 
    public function Token($type, $text) {
        $this->type = $type;
        $this->text = $text;
    }
 
    public function __toString() {
        $tname = ListLexer::$tokenNames[$this->type];
        return "<'" . $this->text . "'," . $tname . ">";
    }
}
 
?>

A simple parser

A parser for our simple language is given below along with the test script. This uses the Token and Lexer classes we defined earlier.

test_parser.php

<?php
 
require_once('ListLexer.php');
require_once('Token.php');
require_once('ListParser.php');
 
 
$lexer = new ListLexer($argv[1]);
$parser = new ListParser($lexer);
$parser->rlist(); // begin parsing at rule list
 
?>

The parser takes our input language as a argument from the command-line. Passing a correct string to the parser will return nothing, as we have not implemented any language processing code. Passing an incorrect string will throw an exception.

D:\localhost\parser>php test_parser.php [a,,c,d]
PHP Fatal error:  Uncaught exception 'Exception' with message 
'Expecting name or list : Found <',',COMMA>'

ListParser.php

<?php
 
require_once('Parser.php');
 
class ListParser extends Parser {
    public function ListParser(Lexer $input) {
        parent::__construct($input);
    }
 
    /** list : '[' elements ']' ; // match bracketed list */
    public function rlist() {
        $this->match(ListLexer::LBRACK);
        $this->elements();
        $this->match(ListLexer::RBRACK);
    }
    /** elements : element (',' element)* ; */
    function elements() {
        $this->element();
        while ($this->lookahead->type == ListLexer::COMMA ) {
            $this->match(ListLexer::COMMA);
            $this->element();
        }
    }
    /** element : name | list ; // element is name or nested list */
    function element() {
        if ($this->lookahead->type == ListLexer::NAME ) {
            $this->match(ListLexer::NAME);
        }
        else if ($this->lookahead->type == ListLexer::LBRACK) {
            $this->rlist();
        }
        else {
            throw new Exception("Expecting name or list : Found "  . 
                                 $this->lookahead);
        }
    }
}
 
?>

Parser.php

<?php
 
abstract class Parser {
    public $input;     // from where do we get tokens?
    public $lookahead; // the current lookahead token
 
    public function Parser(ListLexer $input) {
        $this->input = $input;
        $this->consume();
    }
 
    /** If lookahead token type matches x, consume & return else error */
    public function match($x) {
        if ($this->lookahead->type == $x ) {
            $this->consume();
        } else {
            throw new Exception("Expecting token " .
                                $this->input->getTokenName($x) .
                                ":Found " . $this->lookahead);
        }
    }
    public function consume() {
        $this->lookahead = $this->input->nextToken();
    }
}
 
?>

After working through the code, you may realize that PHP is not a very good candidate for developing programs of such kind. Even though you can accomplish the task in hand, developing a much more complex parser design in PHP can be a challenging task.

If you want to play with the code, you can download it from below.

Download code files
Downloads : 2874 / File size : 2.8 kB

This site is a digital habitat of Sameer Borate, a freelance web developer working in PHP, MySQL and WordPress. I also provide web scraping services, website design and development and integration of various Open Source API's. Contact me at metapix[at]gmail.com for any new project requirements and price quotes.

4 Responses

1

Kyle

November 17th, 2011 at 4:13 pm

I wasn’t going to leave a message, until I realized that your CAPTCHA is displayed as plain text and asks the user to solve an arithmetic problem.

You could write an arithmetic lexer and parser to analyze the natural language and return the answer to the problem. Natural words could be handled as inline comments, whereas numbers and operators are treated as they should be; thus rendering it an ineffective CAPTCHA. I thought that was pretty ironic.

Thanks for the article. I had been toying around with the idea of writing a DSL for a project. But every time I started down that path, I backed away due to scope creep. Hopefully I can get around to it during spare time.

Take care.

2

koubel

November 21st, 2011 at 4:30 am

This (maybe a little) modified approach can be (and some times in good projects is) used for the template implementation. For example Twig -templates uses this approach – https://github.com/fabpot/Twig

3

informatico

November 23rd, 2011 at 5:01 am

Why the projects software fail

The article reflects very well 50 % of the reasons or origins of the failures of a project software. But other one is absent 50 % about which one never speaks: the economy. Without an economic previous serious study on having begun a project it has 50 % of probabilities of which it fails independently of the arguments exposed in his article.

4

Kely

September 12th, 2012 at 4:38 am

Very goog article, A PHP parser is slightly slower but is a real time saver for the building.

Thank you!

Your thoughts

Sign up for fresh content in your email