Syntactic metaprogramming II

This is the second in a series of posts where I am going to explain how to use the metaprogramming features of meta-cedille. If you haven’t read the previous post this post might not make much sense.

This time, I’ll be talking about a better function definition syntax, strings and parsing unrestricted grammars. The function definition syntax is an ML-like syntax. Let me give an example of what syntax we want to end up with:

b-let identity [X : *] (x : X) : X := x.

The main advantage of this notation is that we can get a type annotated definition without having to repeat types. The above example will de-sugar to:

let identity := Λ X : * λ x : X x : ∀ X : * Π x : X X.

So let’s discuss quickly how this works. After parsing the syntax, we will have access to a list of parameters (a parameter is a triple consisting of a flag deciding whether the parameter is supposed to be erased or not, a name and a type) and a type, in a addition to the name and the term representing the body. The list of parameters will then be converted to a stack of lambdas and a stack pis/foralls, which will be prepended to the body and the type respectively. So let’s implement it.

First, I’ll use a better syntax for function application for better readability. It is implemented here, and the syntax is γ[f x1 ... xn], where ?xk refers to an erased application. In the original source this syntax isn’t used because the file that introduces it actually uses the better let definition, but we can do it here.

let init$param$_name__space_=colon=_space__term_ :=
  λ n : init$name λ _ : init$space λ _ : init$space λ t : init$term
    γ[mkPreParam n t].

This is the rule that parses the interior of parentheses, which should be quite straightforward. mkPreParam is the constructor for a parameter without the information whether it is erased.

let init$params := [List Param].
let init$params$=lparen=_param_=rparen=_space__params_ :=
  λ p : PreParam λ _ : init$space γ[cons ?Param γ[mkParam false p]].
let init$params$=lsquare=_param_=rsquare=_space__params_ :=
  λ p : PreParam λ _ : init$space γ[cons ?Param γ[mkParam true p]].
let init$params$ := <nil Param>.

These rules are for parsing a list of parameters. This should be straightforward as well: mkParam is the constructor for a full parameter and in case we parsed it with surrounding parentheses, we pass false to mkParam for an unerased parameter, and true for surrounding square brackets. To turn this list of parameters into the corresponding lambda abstractions, we use the following functions:

let paramToLambda := λ t : init$term λ p : Param
  γ[γ[ifthenelse ?(Π _ : init$name Π _ : init$term Π _ : init$term init$term)
      [paramErased p]
      quote$Lambda
      quote$lambda]
    [paramName p] [paramType p] t]
  : Π _ : init$term Π _ : Param init$term.

let foldWithLambdas :=
  γ[foldl ?Param ?init$term paramToLambda]
  : Π _ : init$params Π _ : init$term init$term.

First, the type of quote$lambda (and all other such abstractions) is Π _ : init$name Π _ : init$term Π _ : init$term init$term. After filling in a parameter, the type would be Π _ : init$term init$term, so we want a function Π _ : Param Π _ : init$term init$term. So this is what paramToLambda does, except that the order of arguments is swapped. paramErased, paramName and paramType are the three projections of the parameter, so paramToLambda applies quote$lambda or quote$Lambda (depending on the value of [paramErased p]) to [paramName p] and [paramType p]. Then, foldWithLambdas applies this function recursively to the list of parameters. The function foldWithPi is defined exactly the same, except that it uses quote$forall and quote$Pi. Finally, the last parsing rule is defined as (sorry for the long name):

let init$stmt$b=minus=let_space__name__space__params_=colon=_space__term__space_=colon==equal=_space'__term_=dot= :=
  λ _ : init$space λ n : init$name λ _ : init$space
  λ params : init$params λ _ : init$space λ type : init$term
  λ _ : init$space λ _ : init$space' λ term : init$term
    γ[quote$annotatedLet n
      γ[foldWithLambdas params term]
      γ[foldWithPi params type]].

Note that this is a parsing rule for stmt, so this is an example of how to define a new top-level statement. The function quote$annotatedLet takes three arguments, n, t and t', and returns the quoted statement let n := t : t'.. After setting the evaluator, we can run the above example and see it in action. It would of course be nice to improve this syntax a little bit, by allowing to bind multiple names to a type in one go (as in (a b : T)). PR’s for this are welcome.

Let’s continue with strings. Strings are of course encased in double quotes:

let init$term$=doublequote=_newstring_=doublequote= :=
  quote$String : Π _ : init$string init$term.

We will get to the function quote$String later. The interior of the double quotes is either empty or a character followed by the same nonterminal:

let init$newstring$ := init$string$nil : init$string.
let init$newstring$_stringchar__newstring_ :=
  init$string$cons : Π _ : init$char Π _ : init$string init$string.

So what remains is to parse an (almost) arbitrary single character, which is done by the following syntax:

let init$stringchar$!=doublequote=! := λ c : init$char c.

The exclamation points encase a list of rules that match a single character, and negate these rules. After parsing the character, it is passed to the function. In this case, it matches every character except a double quote. So, after parsing a string, we get a init$string, which is essentially a list of init$char’s, and we have to transform into a init$term, which is where the above mentioned quote$String function comes into play. It is defined as follows:

b-let quote$String (s : init$string) : init$term :=
  γ[s ?init$term
      λ c : init$char λ rec : init$term
        γ[quote$square γ[quote$square quote$stringCons [quote$char c]] rec]
      quote$stringNil].

The type init$string is defined with a Church encoding, so their elements are functions that pattern match on themselves. This can be seen in the type of γ[s init$term], which, after applying some syntactic sugar is:

(init$char -> init$term -> init$term) -> init$term -> init$term

This means, if s is the empty string, γ[s ?init$term f n] evaluates to n, and if s is c followed by s', it evaluates to γ[f c γ[s' ?init$term f n]]. Thus, in the case of quote$String, if we give the empty string to it, it evaluates to quote$stringNil and if it is of the form c followed by s, it evaluates to γ[quote$square γ[quote$square quote$stringCons [quote$char c]] [quote$String s]] which represents [[stringCons c] s]. The definitions of quote$stringNil and quote$stringCons are quite tedious and not repeated here.

Note that this definition for a string syntax does not feature an escaping mechanism to let the double quote be typed within a string. This has been done mostly for the simplicity of the implementation and is left as an exercise to the reader (PR’s welcome). To parse everything except for a double quote and a backslash character, use !=doublequote=&=backslash=!. A syntax for a single character would also be nice (parsing single quotes is done with just a ' character instead of the escape mechanism).

So now it should be somewhat clear how to define new notations that extend existing syntax. Defining a syntax for lists is not too difficult (see here), while defining a notation for inductive datatypes is probably challenging but given a good understanding of cedille it should still be mostly straightforward.

For the remainder of this post, I want to discuss how to modify the evaluator of the language which is a very powerful feature that can be used to run arbitrary code after the parsing step.

First, I want to give a brief introduction into the idea of the evaluator. In the previous examples, the result of parsing always had to result in something that is of type init$stmt, which is essentially a built-in type. One of the goals of this system was that everything should be modifiable by the user, so this type should not be fixed. To do this, we should be able to replace it with another type, say A, such that elements of A can be turned into executions of the systems. This is done with the meta monad, denoted by ω, which is a monad that can run executions in the environment. It has the monadic operations ε (return/pure) and μ (bind) as well as an operation ξ that executes statements (of type init$stmt) and returns something of type init$metaResult (which is a type that captures errors and other things that the executions of statements might return), so if t is of type init$stmt, ξ t is of type ω init$metaResult. So, a function f : A -> ω B is a function that can execute statements based on its input and return some value of type B, and such a function can be registered as the evaluator with the seteval statement. Then, if the parsing step returns some term t of type A, the environment will execute the value [f t]. Note that the seteval statement does not in any way change any previous definition. It just changes the evaluator that is used to evaluate statements. It is not possible to use the pointer to the current evaluator in a definition.

The function that has been used as the evaluator up to this point is defined as follows:

 let eval := λ s : init$stmt ξ s.

To show off what is possible by changing the evaluator, I will demonstrate how to work around the limitations of the parser, to parse a language that is not context-free. This is just a toy example, but it generalizes to parsing arbitrary languages. The language we want to parse is the language of all strings consisting of the letters a, b and c, such that all letters appear equally often. This language is known to be context-sensitive, so in particular not a LL1 grammar that the parser can handle. To do this, we use a variation of the parsing rules for strings:

let csgdemo$char$a := quote$a : init$char.
let csgdemo$char$b := quote$b : init$char.
let csgdemo$char$c := quote$c : init$char.

let csgdemo$str := [List init$char].
let csgdemo$str$_char__str_ := <cons init$char>.
let csgdemo$str$ := <nil init$char>.

So, str is a nonterminal that parses only strings of the letters a, b and c and returns that data as a string. Next, we define the evaluator:

b-let csgdemoeval (l : [List init$char]) : ω Nat :=
  γ[ifthenelse ?(ω Nat) [abcEqual l]
    ε γ[countOccurencesChar quote$a l]
    μ [eval triggerError] λ _ : init$metaResult ε zero].

Just to make it clear how ε and μ work, in a Haskell-like syntax this would read:

csgdemoeval l =
  if abcEqual l
    then return (countOccurencesChar quote$a l)
    else (eval triggerError) >> return zero

The function abcEqual takes a string and returns true iff the characters a, b and c appear equally as often. countOccurencesChar takes a character and a string, and returns how often that character appears in that string. triggerError is of type init$stmt and always triggers an error when executed (it represents let dummy := dummy., which will either complain that the name dummy is undefined, or that the name dummy is defined). So, csgdemoeval will take the string, and if it has an equal count of the letters it will do nothing and return how often each letter appeared. Otherwise, it will trigger an error and return the number 0 afterwards. Finally, we set the evaluator:

seteval csgdemoeval csgdemo str.

This uses csgdemoeval as the evaluator, str as the starting nonterminal and csgdemo as the namespace.

This is a very simple example of course, but it shows that it is possible to run arbitrary code to accept or reject parse results. This means that one could write a parser library that can handle stronger grammars than the built-in parser and use that library to write parsing rules. At the same time, this demonstrates how previous definitions can be hidden: after the seteval statement, there is no way to execute another seteval command, so it is impossible to go back.