Viewed   78 times

I'm currently in the process of building a PHP Parser written in PHP, as no existing parser came up in my previous question. The parser itself works fairly well.

Now obviously a parser by itself does little good (apart from static analysis). I would like to apply transformations to the AST and then compile it back to source code. Applying the transformations isn't much of a problem, a normal Visitor pattern should do.

What my problem currently is, is how to compile the AST back to source. There are basically two possibilities I see:

  1. Compile the code using some predefined scheme
  2. Keep the formatting of the original code and apply 1. only on Nodes that were changed.

For now I would like to concentrate on 1. as 2. seems pretty hard to accomplish (but if you got tips concerning that, I would like to hear them).

But I'm not really sure which design pattern can be used to compile the code. The easiest way I see to implement this, is to add a ->compile method to all Nodes. The drawback I see here, is that it would be pretty hard to change the formatting of the generated output. One would need to change the Nodes itself in order to do that. Thus I'm looking for a different solution.

I have heard that the Visitor pattern can be used for this, too, but I can't really imagine how this is supposed to work. As I understand the visitor pattern you have some NodeTraverser that iterates recursively over all Nodes and calls a ->visit method of a Visitor. This sounds pretty promising for node manipulation, where the Visitor->visit method could simply change the Node it was passed, but I don't know how it can be used for compilation. An obvious idea would be to iterate the node tree from leaves to root and replace the visited nodes with source code. But this somehow doesn't seem a very clean solution?



The problem of converting an AST back into source code is generally called "prettyprinting". There are two subtle variations: regenerating the text matching the original as much as possible (I call this "fidelity printing"), and (nice) prettyprinting, which generates nicely formatted text. And how you print matters depending on whether coders will be working on the regenerated code (they often want fidelity printing) or your only intention is to compile it (at which point any legal prettyprinting is fine).

To do prettyprinting well requires usually more information than a classic parser collects, aggravated by the fact that most parser generators don't support this extra-information collection. I call parsers that collect enough information to do this well "re-engineering parsers". More details below.

The fundamental way prettyprinting is accomplished is by walking the AST ("Visitor pattern" as you put it), and generating text based on the AST node content. The basic trick is: call children nodes left-to-right (assuming that's the order of the original text) to generate the text they represent, interspersing additional text as appropriate for this AST node type. To prettyprint a block of statements you might have the following psuedocode:

     Print("{"}; PrintNewline();
     Call PrettyPrint(Node.children[1]); // prints out statements in block
     Print("}"); PrintNewline();

     do i=1,number_of_children
         Call PrettyPrint(Node.children[i]); Print(";"); PrintNewline(); // print one statement

Note that this spits out text on the fly as you visit the tree.

There's a number of details you need to manage:

  • For AST nodes representing literals, you have to regenerate the literal value. This is harder than it looks if you want the answer to be accurate. Printing floating point numbers without losing any precision is a lot harder than it looks (scientists hate it when you damage the value of Pi). For string literals, you have to regenerate the quotes and the string literal content; you have to be careful to regenerate escape sequences for characters that have to be escaped. PHP doubly-quoted string literals may be a bit more difficult, as they are not represented by single tokens in the AST. (Our PHP Front End (a reengineering parser/prettyprinter represents them essentially as an expression that concatenates the string fragments, enabling transformations to be applied inside the string "literal").

  • Spacing: some languages require whitespace in critical places. The tokens ABC17 42 better not be printed as ABC1742, but it is ok for the tokens ( ABC17 ) to be printed as (ABC17). One way to solve this problem is to put a space wherever it is legal, but people won't like the result: too much whitespace. Not an issue if you are only compiling the result.

  • Newlines: languages that allow arbitrary whitespace can technically be regenerated as a single line of text. People hate this, even if you are going to compile the result; sometimes you have to look at the generated code and this makes it impossible. So you need a way to introduce newlines for AST nodes representing major language elements (statements, blocks, methods, classes, etc.). This isn't usually hard; when visiting a node representing such a construct, print out the construct and append a newline.

  • You will discover, if you want users to accept your result, that you will have to preserve some properties of the source text that you wouldn't normally think to store For literals, you may have to regenerate the radix of the literal; coders having entered a number as a hex literal are not happy when you regenerate the decimal equivalent even though it means exactly the same thing. Likewise strings have to have the "original" quotes; most languages allow either " or ' as string quote characters and people want what they used originally. For PHP, which quote you use matters, and determines which characters in the string literal has to be escaped. Some languages allow upper or lower case keywords (or even abbreviations), and upper and lower case variable names meaning the same variable; again the original authors typically want their original casing back. PHP has funny characters in different type of identifiers (e.g., "$") but you'll discover that it isn't always there (see $ variables in literal strings). Often people want their original layout formatting; to do this you have to store at column-number information for concrete tokens, and have prettyprinting rules about when to use that column-number data to position prettyprinted text where in the same column when possible, and what to do if the so-far-prettyprinted line is filled past that column.

  • Comments: Most standard parsers (including the one you implemented using the Zend parser, I'm pretty sure) throw comments away completely. Again, people hate this, and will reject a prettyprinted answer in which comments are lost. This is the principal reason that some prettyprinters attempt to regenerate code by using the original text (the other is to copy the original code layout for fidelity printing, if you didn't capture column-number information). IMHO, the right trick is to capture the comments in the AST, so that AST transformations can inspect/generate comments too, but everybody makes his own design choice.

All of this "extra" information is collected by a good reenginering parser. Conventional parsers usually don't collect any of it, which makes printing acceptable ASTs difficult.

A more principled approach distinguishes prettyprinting whose purpose is nice formatting, from fidelity printing whose purpose is to regenerate the text to match the original source to a maximal extent. It should be clear that at the level of the terminals, you pretty much want fidelity printing. Depending on your purpose, you can pretty print with nice formatting, or fidelity printing. A strategy we use is to default to fidelity printing when the AST hasn't been changed, and prettyprinting where it has (because often the change machinery doesn't have any information about column numbers or number radixes, etc.). The transformations stamp the AST nodes that are newly generated as "no fidelity data present".

An organized approach to prettyprinting nicely is to understand that virtually all text-based programming language are rendered nicely in terms of rectangular blocks of text. (Knuth's TeX document generator has this idea, too). If you have some set of text boxes representing pieces of the regenerated code (e.g., primitive boxes generated directly for the terminal tokens), you can then imagine operators for composing those boxes: Horizontal composition (stack one box to the right of another), Vertical (stack boxes on top of each other; this in effect replaces printing newlines), Indent (Horizontal composition with a box of blanks), etc. Then you can construct your prettyprinter by building and composing text boxes:

     Box1=PrimitiveBox("{"); Box2=PrimitiveBox("}");
     ChildBox=PrettyPrint(Node.children[1]); // gets box for statements in block
     return ResultBox;

     do i=1,number_of_children
         ResultBox=VerticalBox(ResultBox,HorizontalBox(PrettyPrint(Node.children[i]); PrimitiveBox(";")

The real value in this is any node can compose the text boxes produced by its children in arbitrary order with arbitrary intervening text. You can rearrange huge blocks of text this way (imagine VBox'ing the methods of class in method-name order). No text is spit out as encountered; only when the root is reached, or some AST node where it is known that all the children boxes have been generated correctly.

Our DMS Software Reengineering Toolkit uses this approach to prettyprint all the languages it can parse (including PHP, Java, C#, etc.). Instead of attaching the box computations to AST nodes via visitors, we attach the box computations in a domain-specific text-box notation

  • H(...) for Horizontal boxes
  • V(....) for vertical boxes
  • I(...) for indented boxes)

directly to the grammar rules, allowing us to succinctly express the grammar (parser) and the prettyprinter ("anti-parser") in one place. The prettyprinter box rules are compiled automatically by DMS into a visitor. The prettyprinter machinery has to be smart enough to understand how comments play into this, and that's frankly a bit arcane but you only have to do it once. An DMS example:

block = '{' statements '}' ; -- grammar rule to recognize block of statements
<<PrettyPrinter>>: { V('{',I(statements),'}'); };

You can see a bigger example of how this is done for Wirth's Oberon programming language PrettyPrinter showing how grammar rules and prettyprinting rules are combined. The PHP Front End looks like this but its a lot bigger, obviously.

A more complex way to do prettyprinting is to build a syntax-directed translator (means, walk the tree and build text or other data structures in tree-visted order) to produce text-boxes in a special text-box AST. The text-box AST is then prettyprinted by another tree walk, but the actions for it are basically trivial: print the text boxes. See this technical paper: Pretty-printing for software reengineering

An additional point: you can of course go build all this machinery yourself. But the same reason that you choose to use a parser generator (its a lot of work to make one, and that work doesn't contribute to your goal in an interesting way) is the same reason you want to choose an off-the-shelf prettyprinter generator. There are lots of parser generators around. Not many prettyprinter generators. [DMS is one of the few that has both built in.]

Saturday, September 24, 2022

You can disassemble a binary and get back assembly source, but there is no way to get back your original Objective-C structured source code.

EDIT: You may want to give Hopper a try. I didn't try it personally yet but Mike Ash says it's good.

Saturday, November 19, 2022

Wikipedia has a great overview of how the Visitor pattern works, although the sample implementation that they use is in Java. You can easily port that to Python, though, no?

Basically, you want to implement a mechanism for double dispatch. Each node in your AST would need to implement an accept() method (NOT a visit() method). The method takes, as an argument, a visitor object. In the implementation of this accept() method, you call a visit() method of the visitor object (there will be one for each AST node type; in Java, you'll use parameter overloading, in Python I suppose you can use different visit_*() methods). The correct visitor will then be dispatched with the correct Node type as argument.

Wednesday, October 12, 2022

Here it is.

First, you're gonna buy the ANTLR4 book ;-)

Second, you'll download antlr4 jar and the java grammar (

Then, you can change the grammar a little bit, adding these to the header

grammar Java;

    language = Java;

// starting point for parsing a java file

I'll change a little thing in the grammar just to illustrate something.

    :   (type|'void') Identifier formalParameters ('[' ']')*
        ('throws' qualifiedNameList)?
        (   methodBody
        |   ';'
    :   (type|'void') myMethodName formalParameters ('[' ']')*
        ('throws' qualifiedNameList)?
        (   methodBody
        |   ';'

    :   Identifier

You see, the original grammar does not let you identify the method identifier from any other identifier, so I've commented the original block and added a new one just to show you how to get what you want.

You'll have to do the same for other elements you want to retrieve, like the comments, that are currently being just skipped. That's for you :-)

Now, create a class like this to generate all the stubs

package mypackage;

public class Gen {

    public static void main(String[] args) {
        String[] arg0 = { "-visitor", "/home/leoks/EclipseIndigo/workspace2/SO/src/mypackage/Java.g4", "-package", "mypackage" };


Run Gen, and you'll get some java code created for you in mypackage.

Now create a Visitor. Actually, the visitor will parse itself in this example

package mypackage;


import mypackage.JavaParser.MyMethodNameContext;

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

 * @author Leonardo Kenji Feb 4, 2014
public class MyVisitor extends JavaBaseVisitor<Void> {

     * Main Method
     * @param args
     * @throws IOException
    public static void main(String[] args) throws IOException {
        ANTLRInputStream input = new ANTLRInputStream(new FileInputStream("/home/leoks/EclipseIndigo/workspace2/SO/src/mypackage/")); // we'll
                                                                                                                                                    // parse
                                                                                                                                                    // this
                                                                                                                                                    // file
        JavaLexer lexer = new JavaLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        JavaParser parser = new JavaParser(tokens);
        ParseTree tree = parser.compilationUnit(); // see the grammar ->
                                                    // starting point for
                                                    // parsing a java file

        MyVisitor visitor = new MyVisitor(); // extends JavaBaseVisitor<Void>
                                                // and overrides the methods
                                                // you're interested

     * some attribute comment
    private String  someAttribute;

    public Void visitMyMethodName(MyMethodNameContext ctx) {
        System.out.println("Method name:" + ctx.getText());
        return super.visitMyMethodName(ctx);


and that's it.

You'll get something like

Method name:main
Method name:visitMyMethodName

ps. one more thing. While I was writing this code in eclipse, I've got a strange exception. This is caused by Java 7 and can be fixed just adding these parameters to your compiler (thanks to this link

Monday, October 31, 2022

Attributes are additional values associated with something of central interest. In the case of ASTs, you can think of them as pairs (attribute_name, attribute_value) associated with each AST node, where the attribute name corresponds to some interesting fact-type, and the attribute value corresponds to the actual state of that fact (e.g., "(constants_in_subtree_count,12)").

Inherited and synthesized are terms for describing how the attribute values are computed for each AST node, usually associated with the grammar rule that produces the AST node using its children.

Synthesized attributes are those whose value is computed from attribute values from children nodes, and are being passed up the tree. Often, values of synthesized attributes are combined to produce an attribute for the parent node. If an AST node has two children, each of which have their own attributes (constants_in_subtree_count,5) and (constants_in_subtree_count,7), then by passing those attributes up, the parent can compute his corresponding attribute (constants_in_subtree_count,12).

Inherited attributes are those passed from the parent down to the child. If the root of a function AST "knows" the function return type is (return_type,integer) as an attribute, it can pass the return type to the children of the function root, e.g. to the function body. Someplace deep down in that tree is an actual return statement; if it receives the inherited attribute (return_type,X), it can check that the result it is computing is the correct type.

In practice, you want to be able to define arbitrary sets of attributes for nodes, and pass them up and down the tree for the multiple purposes required to process ASTs (building symbol tables, constructing control flow graphs, doing type checking, computing metrics, ...). An attribute grammar generator is a kind of parser generator that will take grammar rules, sets of attribute definitions, and rules about how to compute synthesized and inherited attributes for the nodes involved in each rule, and generates both a parser and an AST walker that computes all the attributes.

The value of this idea is it provides an organizing principle supported by automation, that can be used to compute many interesting things about ASTs in a regular way. Otherwise you get to code all that stuff using ad hoc code.

Our DMS Software Reengineering Toolkit is an AST manipulation system (actually source-to-source program transformations) that heavily uses parallel attribute evaluation to compute all kinds of useful analyses over ASTs: conventional metrics, symbol tables, type checks (like the return type check I described above), control and data flow extraction from code, as well as other not-so-easily described but useful results computed over subtrees ("list of side effecting assignments in this expression"). Why parallel? Well, attribute computations in subtrees are essentially independent, so the parallelism is already there, and when you deal with really big trees performance matters. DMS often deals with thousands of compilation units, each producing a (possibly big) AST.

Friday, September 23, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :