Introduction
This is a quick look at parsing in clojure. First using instaparse and then writing the lexer and parser by hand. The comparison should illustrate how great instaparse is but also show that writing a simple lexer & parser is not as complex as some would think.
BTW this is my first clojure project so I may have got some of the idioms in the code incorrect. I'll update the code samples based on feedback here and on the project repo in github :)
The demo project
To demonstrate instaparse I'll be implementing a simple external DSL. The DSL should have the following characteristics
- Expressions written as sexprs
- External DSL - I'm not interested in using the clojure reader to read the sexpr for this demo
- Constrained - functions can only be defined in clojure not in the DSL itself. The functions available to the DSL must be strictly controlled.
Instaparse
Using instaparse
Instaparse (https://github.com/Engelberg/instaparse) is a clojure library for generating a parser (and lexer) from a EBNF/ABNF. It is one of the easiest parser generators I've used, I highly recommend giving it a try.The grammar
The instaparse page has a nice introduction to the grammar syntax. Start there if you are not familiar with EBNF.Here is the grammar that I'll be parsing
S = (expression
expression = list | vector | atom
list = <'('>
vector = <'['> (expression
atom = number | string | name
number = #'\d+'
string = <'"'> #'[^\"]+' <'"'>
name = #'[a-zA-Z\+-]([0-9a-zA-Z\+-]*)'
ws = #'\s+'
This is pretty standard EBNF. Some things to note
- Wrap an element in angle brackets to remove it from the output e.g.
- Match literal characters with single quotes. e.g. '('
- Regular expressions using #'regex'
- Remember to escape regex characters correctly. See the code example for the correct escaping
Again the instaparse page has a nice introduction that covers all of this.
The output parse tree
The output parse tree from instaparse can be in hiccup or enliven format. I'll be using the default hiccup format.
As an example here is the output for "(+ 1 2 3) 4"
[:S
[:expression
[:list
[:expression [:atom [:name "+"]]]
[:expression [:atom [:number "1"]]]
[:expression [:atom [:number "2"]]]
[:expression [:atom [:number "3"]]]]]
[:expression [:atom [:number "4"]]]]
Instaparse can visualise a parse tree using graphviz and rhizome (see https://github.com/Engelberg/instaparse#visualizing-the-tree). E.g. for the parse tree above you get this
Interpreting the parse tree
There are several ways to interpret the output from instaparse, e.g. using zippers or using the built in instaparse transformation function. However I chose to use simple recursive functions since it is so simple.(defmulti run (fn [s] (nth s 0))) (defmethod run :S [[s & es]] (last (doall (map run es)))) (defmethod run :expression [[e t]] (run t)) (defmethod run :atom [[a t]] (run t)) (defmethod run :number [[n val]] (read-string val)) (defmethod run :string [[s val]] val) (defmethod run :vector [[v & vs]] (vec (map run vs))) (defmethod run :name [[n & nn]] (first nn)) (defmethod run :list [[l n & ls]] (let [args (map run ls)] (apply (methods (run n)) args)))
The multimethod's dispatch function gets the first item from each vector. Look at the parse tree above, you'll see that this will always be the type of the current element (:S or :expression or :number etc)
Each method then is responsible for destructuring its element type. E.g. the :number method must parse the number and return a string. The :vector method must return a vector. Each method calls the run multimethod recursively to get the lowest level atom
Notice that the :S method calls last on doall, which is called to force evaluation of the whole lazy seq. last is called to get the last value. I.e. the parser will return the last value evaluated just as clojure would.
The :list method is where the interpreter actually "runs" functions called by the DSL.
(defmethod run :list [[l n & ls]] (let [args (map run ls)] (apply (methods (run n)) args)))
The parameters [ [l n & ls] ] destructure the incoming element into
- l = the :list
- n = the name of the function as a :name element
- s = the method arguments
Remember that a list is executed by treating the first expression as the function and the rest as the arguments to that function.
Once we have the arguments they must be evaluated by calling run for each argument
(map run ls)
We get the name of the function to run
(run n)
We look up the actual function to call in the methods map. It is this map that lets us control exactly which functions can be called. All together it looks like this
(let [args (map run ls) (apply (methods (run n)) args)))
Full sample code
Here is the full code for the DSL parser and interpreter using instaparse(ns cljsexp-instaparse.core (:require [instaparse.core :as insta])) (def parse (insta/parser "S = (expression)* expression = list | vector | atom list = <'('> (expression )* <')'> vector = <'['> (expression )* <']'> atom = number | string | name number = #'\\d+' string = <'\"'> #'[^\\\"]+' <'\"'> name = #'[a-zA-Z\\+-]([0-9a-zA-Z\\+-]*)' ws = #'\\s+'")) (def methods {"+" + "-" - "*" * "/" / "++" inc "--" dec "prn" println}) (defmulti run (fn [s] (nth s 0))) (defmethod run :S [[s & es]] (last (doall (map run es)))) (defmethod run :expression [[e t]] (run t)) (defmethod run :atom [[a t]] (run t)) (defmethod run :number [[n val]] (read-string val)) (defmethod run :string [[s val]] val) (defmethod run :vector [[v & vs]] (vec (map run vs))) (defmethod run :name [[n & nn]] (first nn)) (defmethod run :list [[l n & ls]] (let [args (map run ls)] (apply (methods (run n)) args)))
Conclusion - instaparse
Instaparse is amazing. It makes writing a parser very easy indeed. A simple sexp parser and interpreter in less that 40 lines of clojure is a great result.A simple recursive descent parser
Writing the lexer and parser by hand is an interesting exercise as it shows that its not too hard to do. However in my opinion it also shows how much simpler instaparse makes things even for simple projects.
For what it is worth note that there is no mutable state in this code. All the functions are pure. This made testing very easy.
Lexing
Lexing or tokenising a string is the process of converting the characters from the source code into higher level tokens (equivalent to taking individual letters and making words).
E.g. taking this character stream
| | | | | | | | | | | | |
| ( | i | f | | ( | a | n | d | ( | a | b | c |
| | | | | | | | | | | | |
And creating these tokens
left-paren, if, left-paren, and, left-paren, abc
Each token has meta-data associated with it. Such as the line and column in the source file and the type of token (string vs name vs paren etc).
Tokenising the input means that the parser does not need to deal with individual characters but rather can work with higher level tokens. This greatly simplifies the design as the concerns of lexing the input and parsing the resulting tokens can be separated. In a recursive descent parser you could lex the next token on demand rather than lex everything first as I have here.
NB remember that the output of the tokeniser is a flat list of tokens. No meaning has yet been inferred from the source code
In the code above each token has the following clojure structure
{:type :xxx, :val xxx, :line xxx, :col xxx, :expressions []}
Each token has a
- Type (e.g. name/string/list)
- Value (e.g. the numeric or string value of the text)
- The line and column number that the token started in the source file
- A place holder for nested expressions
Matching the next token
(def tokenMap {:byChar { \( :lparen \) :rparen, \[ :lbracket, \] :rbracket} :byRegex { #"'" parseString #"\d+" parseNumber #"[a-zA-Z\+\-\*\\\/\?_\$\<\>=]" parseName #";" parseComment }})
Here there are two maps. The first identifies single character tokens such as brackets or parentheses. The second uses a regular expression to match the first letter of a token and defines the function that gets called to tokenise it.
For example if the tokeniser gets a semi-colon it calls the parseComment function which calls the parseRegex helper function. Below you can see these two methods. When a semi-colon is found the regex will match to the end of the line and the current position will be moved (moveRight) by the number of matched characters.
(defn parseRegex [state, typeName, token, re] (let [s (subs (currentLine state) (:col state)) val (re-find re s)] ;Does the remainder of the line match the regex - it should! (if val (assoc (moveRight state (count val)) :token token :val val) (throw (Exception. (str "Failed to parse " typeName)))))) (defn parseComment [state] (parseRegex state "comment" :comment #";.*"))
Moving in the input stream
Below is the moveRight function which moves right in the input stream. Notice that this takes the current position in a state argument and returns a new state as a result. I.e. nothing is mutated.(defn moveRight [state by] "Move current position 1 char to the right, roll over to next line if required" (let [updated (assoc state :col (+ (:col state) by) )] (let [line (currentLine state)] (if (< (:col updated) (count line)) ;Still space on current line, return it updated ;Move to next line (assoc state :col 0 :line (inc (:line state)))))))
Running the tokeniser
Finally here are the two functions that control the tokenising
(defn- nextToken [state] "Gets the next token" (let [c (currentChar state)] (cond (nil? c) (clearToken state) ;Ignore white space (Character/isSpaceChar c) (recur (moveRight state 1)) ;Check if a token can be found in the token map by character :else (if-let [token ((:byChar tokenMap) c)] (assoc (moveRight state 1) :token token :val c) ;Nothing found so now search by regex ; Get the function associated with the first regex that matches and call that (if-let [r (first (filter #(re-matches (% 0) (str c)) (:byRegex tokenMap)))] ((r 1) state) (throw (Exception. (str "dont understand next token - " c state)))))))) (defn- tokenise [state] (loop [nextState (nextToken state), tokens []] (if (= :none (:token nextState)) tokens (recur (nextToken nextState) (conj tokens {:line (:line nextState), :col (:col nextState), :val (:val nextState),:type (:token nextState)})))))
nextToken gets 1 next token
tokenise repeatedly calls nextToken until the whole input stream has been tokenised
Parsing
At this point the lexer has lexed the entire file and the parser can now parse the token stream.
The function that runs the parser is parseAll
(defn- parseAll [allTokens] (loop [expressions [], tokens allTokens] (let [r (parseExpression (first tokens) (rest tokens))] (if (= 0 (count (:expr r))) expressions (recur (conj expressions (:expr r)) (:tokens r))))))
But all the work is actually done in parseExpression. This is quite a long function that is just a large case statement. Not pretty but reasonably clear, hopefully.
(defn- parseExpression [token tokens] (case (:type token) (nil '()) [nil tokens] :name {:expr {:type :name, :val (:val token), :line (:line token), :col (:col token), :expressions []} :tokens tokens} :string {:expr {:type :string, :val (:val token), :line (:line token), :col (:col token), :expressions []} :tokens tokens} :number {:expr {:type :number, :val (read-string (:val token)), :line (:line token), :col (:col token), :expressions []} :tokens tokens} (:lparen :lbracket) (let [grp (if (= :lparen (:type token)) {:start :lparen, :end :rparen, :type :list} {:start :lbracket, :end :rbracket, :type :vector})] (loop [expressions [] [loopToken & loopTokens] tokens] (let [type (:type loopToken)] (cond (or (nil? token) (= '() token)) (throw (Exception. (str "EOF waiting for :rparen"))) (= (:end grp) type) {:expr {:type (:type grp) :val (:type grp) :line (:line token) :col (:col token) :expressions expressions} :tokens loopTokens} :else (let [r (parseExpression loopToken loopTokens)] (recur (conj expressions (:expr r)) (:tokens r)))))))))
This function is switching on the first token and returning the matched token and the remaining tokens. For example when it gets a :name it returns an :expr of type :name and returns the rest of the tokens.
When parseExpression gets lparen or lbracket it will recursively loop through the tokens until the end of the list or vector, returning the tokens that have not been consumed.
Interpreting the syntax tree
Evaluating the syntax tree is similar to the code in the instaparse Version(declare eval) (defmulti run (fn [x] (:type x))) (defmethod run :string [e] (:val e)) (defmethod run :number [e] (:val e)) (defmethod run :name [e] (:val e)) (defmethod run :vector [e] (vec (map run (:expressions e)))) (defmethod run :list [e] (do (let [f (run (first (:expressions e))) args (map run (rest (:expressions e)))] (apply (get funcs f) args)))) (defmethod run :default [e] (println "unknown: " e)) (defn eval [[car & cdr]] (let [r (run car)] (if (empty? cdr) r (recur cdr))))
Again a multimethod is used to recursively evaluate the syntax tree and as with the instaparse code only functions defined in the 'funcs' map may be executed.
Conclusion - hand written
The hand written lexer and parser are a lot longer than just using instaparse. However it is not that complicated to do manually. Personally I'll be using instaparse for 99% of my Clojure DSL needs but it is always good to know how to do it manually.
The full source code
(ns cljsexp-simple.core (def funcs {"prn" println "+" +}) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (defn currentLine [state] "Gets the current line" (get (:code state) (:line state))) (defn currentChar [state] "Gets the current charater" (get (currentLine state) (:col state))) (defn moveRight [state by] "Move current position 1 char to the right, roll over to next line if required" (let [updated (assoc state :col (+ (:col state) by) )] (let [line (currentLine state)] (if (< (:col updated) (count line)) ;Still space on current line, return it updated ;Move to next line (assoc state :col 0 :line (inc (:line state))))))) (defn parseRegex [state, typeName, token, re] (let [s (subs (currentLine state) (:col state)) val (re-find re s)] ;Does the remainder of the line match the regex - it should! (if val (assoc (moveRight state (count val)) :token token :val val) (throw (Exception. (str "Failed to parse " typeName)))))) (defn parseName [state] (parseRegex state "name" :name #"[a-zA-Z\+\-\*\\\/\?_\$\<\>=]+")) (defn parseComment [state] (parseRegex state "comment" :comment #";.*")) (defn parseString [state] (parseRegex state "string" :string #"'[^']+'")) (defn parseNumber [state] (parseRegex state "number" :number #"\d+")) (def tokenMap {:byChar { \( :lparen \) :rparen, \[ :lbracket, \] :rbracket} :byRegex { #"'" parseString #"\d+" parseNumber #"[a-zA-Z\+\-\*\\\/\?_\$\<\>=]" parseName #";" parseComment }}) (defn clearToken [state] (assoc state :token :none :val :none)) (defn- nextToken [state] "Gets the next token" (let [c (currentChar state)] (cond (nil? c) (clearToken state) ;Ignore white space (Character/isSpaceChar c) (recur (moveRight state 1)) ;Check if a token can be found in the token map by character :else (if-let [token ((:byChar tokenMap) c)] (assoc (moveRight state 1) :token token :val c) ;Nothing found so now search by regex ; Get the function associated with the first regex that matches and call that (if-let [r (first (filter #(re-matches (% 0) (str c)) (:byRegex tokenMap)))] ((r 1) state) (throw (Exception. (str "dont understand next token - " c state)))))))) (defn- tokenise [state] (loop [nextState (nextToken state), tokens []] (if (= :none (:token nextState)) tokens (recur (nextToken nextState) (conj tokens {:line (:line nextState), :col (:col nextState), :val (:val nextState),:type (:token nextState)}))))) (defn- parseExpression [token tokens] (case (:type token) (nil '()) [nil tokens] :name {:expr {:type :name, :val (:val token), :line (:line token), :col (:col token), :expressions []} :tokens tokens} :string {:expr {:type :string, :val (:val token), :line (:line token), :col (:col token), :expressions []} :tokens tokens} :number {:expr {:type :number, :val (read-string (:val token)), :line (:line token), :col (:col token), :expressions []} :tokens tokens} (:lparen :lbracket) (let [grp (if (= :lparen (:type token)) {:start :lparen, :end :rparen, :type :list} {:start :lbracket, :end :rbracket, :type :vector})] (loop [expressions [] [loopToken & loopTokens] tokens] (let [type (:type loopToken)] (cond (or (nil? token) (= '() token)) (throw (Exception. (str "EOF waiting for :rparen"))) (= (:end grp) type) {:expr {:type (:type grp) :val (:type grp) :line (:line token) :col (:col token) :expressions expressions} :tokens loopTokens} :else (let [r (parseExpression loopToken loopTokens)] (recur (conj expressions (:expr r)) (:tokens r))))))))) (defn- parseAll [allTokens] (loop [expressions [], tokens allTokens] (let [r (parseExpression (first tokens) (rest tokens))] (if (= 0 (count (:expr r))) expressions (recur (conj expressions (:expr r)) (:tokens r)))))) (defn parse [code] (let [tokens (tokenise {:code code, :line 0, :col 0, :val :none, :token :none}) result (parseAll tokens)] result)) ;;;;;;;;;;;;;;;;;;;;;;; (declare eval) (defmulti run (fn [x] (:type x))) (defmethod run :string [e] (:val e)) (defmethod run :number [e] (:val e)) (defmethod run :name [e] (:val e)) (defmethod run :vector [e] (vec (map run (:expressions e)))) (defmethod run :list [e] (do (let [f (run (first (:expressions e))) args (map run (rest (:expressions e)))] (apply (get funcs f) args)))) (defmethod run :default [e] (println "unknown: " e)) (defn eval [[car & cdr]] (let [r (run car)] (if (empty? cdr) r (recur cdr))))