CompSci 654 Spring 2008
The Cool compiler project provides a number of basic data types to make the task of writing a Cool compiler tractable in the timespan of the course. This document provides an overview of the Cool compiler support code, which includes:
This document should be read in conjunction with the source code. With the exception of the abstract syntax tree package (which is automatically generated), there are also extensive comments in the source.
The purpose of the Cool support code is to be easy to understand and use. There are much more efficient implementations of most of the data structures used in the system. It is recommended that students implementing a Cool compiler also stick to simple, obviously correct, and perhaps inefficient implementations--writing a compiler that works correctly is usually difficult enough.
The Cool system is written in C++ and it is assumed that the reader has some familiarity with that language. The base system deliberately uses a simple subset of the language. The heavily used features are: classes, single inheritance, virtual functions, and templates. Overloading is used very sparingly; with one exception, operator overloading is avoided completely (the exception is the operator «). Destructors are not used and in fact memory management is ignored completely. Memory management is a very important part of software development as it is currently practiced in industry, but memory management is also tricky, error-prone, and very time consuming to get completely right. It is suggested that students also not worry about memory management in writing their own Cool compilers.
The file list.h implements a simple linked list datatype. The operations are similar to those provided in Lisp-family languages: a constructor List adds a new element to the front of a list, hd returns the first element of a list, and tl returns the tail of a list. Functions are also provided for computing the length of a list, applying a function to every element of a list, and printing a list. Example uses of lists can be found in the implementation of the symbol tables. In fact, this is the only place where we use this ADT; elsewhere we use the STL.
All compilers manage large numbers of strings such as program identifiers, numerical constants, and string constants. Often, many of these strings are the same. For example, each identifier typically occurs many times in a program. To ensure that string constants are stored compactly and manipulated efficiently, a specialized data structure, the string table, is employed.
A string table is a lookup table that maintains a single copy of each string. The Cool string table class provides methods for inserting and querying string tables in a variety of ways (see the file stringtab.h). The components of Cool string tables are of type Entry. Each Entry stores a string, the length of the string, and an integer index unique to the string.
An important point about the structure of the Cool compiler is that there are actually three distinct string tables: one for string constants (stringtable), one for integer constants (inttable), and one for identifiers (idtable). The code generator must distinguish integer constants and string constants from each other and from identifiers, because special code is produced for each string constant and each integer constant in the program. Having three distinct string tables makes this distinction easy. Note that each of the three tables has a different element type (StrEntry, IntEntry, and IdEntry), each of which is a derived class of Entry. Throughout the rest of the compiler (except parts of the code generator), a pointer to an Entry is called a Symbol, irrespective of whether the symbol represents an integer, string, or identifier.
Because string tables store only one copy of each string, comparing whether two IntEntrys, StrEntrys, or IdEntrys x and y represent the same string can be done simply by comparing the two pointers x == y. Note that it does not make sense to compare entries from different string tables (e.g., IntEntrys with StrEntrys) as these are guaranteed to be different even if the strings are the same.
Three methods are provided to add elements to a table: add_string(char *s,int m), which adds a C string s of exactly m characters; add_string(std::string s), which adds a C++ string s to the table; and add_int(int i), which converts integer i to a string and adds the string to the table. Each of these methods returns a type derived from Entry to describe the symbol table entry, on which the method get_string is defined for extracting the entry's (C++) string.
In addition to strings, compilers must also determine and manage the scope of program names. A symbol table is a data structure for managing scope. Conceptually, a symbol table is just another lookup table. The key is the symbol (the name) and the result is whatever information has been associated with that symbol (e.g., the symbol's type).
In addition to adding and removing symbols, symbol tables also support operations for entering and exiting scopes and for checking whether an identifier is already defined in the current scope. The lookup operation must also observe the scoping rules of the language; if there are multiple definitions of identifier x, the scoping rules determine which definition a lookup of x returns. In most languages, including Cool, inner definitions hide outer definitions. Thus, a lookup on x returns the definition of x from the innermost scope with a definition of x.
Cool symbol tables are implemented as lists of scopes, where each scope is a list of (identifier,data) pairs. The ``data'' is whatever data the programmer wishes to associate with each identifier. The symbol table operations are very straightforward to define on this structure and are documented in symtab.h. An example illustrating the use of symbol tables is in the file symtab_example.cc.
The files utilities.h and utilities.cc define a few functions useful in writing and debugging a Cool parser and lexical analyzer. See the source code for documentation.
After lexical analysis and parsing, a Cool program is represented internally by the Cool compiler as an abstract syntax tree. The project comes with a definition of Cool abstract syntax trees (ASTs) built in. The AST package is by far the largest piece of code in the base system and requires the most time to learn. The learning process is made more complex because the AST code is generated automatically from a specification in the file cool-tree.aps, with handcoded parts added. While the generated code is quite simple and regular in structure, it has few comments. This section serves as the documentation for the AST package.
The AST data type provides, for each kind of Cool construct, a class
for representing expressions of that kind. There is a class for let expressions, another class of + expressions, and so on.
Objects of these classes are nodes in Cool abstract syntax trees.
For example, an expression
+
is represented by a +
expression object, which has two subtrees: one for the tree representing
the expression
and one for the tree representing the
expression
.
The Cool abstract syntax is specified in a language called APS. In APS terminology, the various kinds of abstract syntax tree nodes (let, plus, etc.) are called constructors. (Don't confuse this use of the term ``constructor'' with C++ constructors; while similar, this is a slightly different meaning taken from functional languages that predates C++.) The form of the AST is described by a set of phyla. Each phylum has one or more constructors.
Phyla are really just types. That is, instead of having one large group of undifferentiated constructors, the constructors are grouped together according to function, so that, for example, the constructors for expression ASTs are distinguished from the constructors for class ASTs. The phyla are defined at the beginning of cool-tree.aps:
module COOL begin ... phylum Program; phylum Class_; phylum Classes := SEQUENCE[Class_]; phylum Feature; phylum Features := SEQUENCE[Feature]; phylum Formal; phylum Formals := SEQUENCE[Formal]; phylum Expression; phylum Expressions := SEQUENCE[Expression]; phylum Case; phylum Cases := SEQUENCE[Case]; ...
From the definition it can be seen that there are two distinct kinds of phyla: ``normal'' phyla and sequence phyla. ``Normal'' phyla each have associated constructors; sequence phyla have a fixed set of sequence operations.
Each constructor takes typed arguments and returns a typed result. The types may either be phyla or any ordinary C++ type. In fact, the phyla declarations are themselves compiled into C++ class declarations by an APS compiler. A sample constructor definition is
constructor class_(name : Symbol; parent: Symbol; features : Features;
filename : Symbol) : Class_;
This declaration specifies that the class_ constructor1 takes four
arguments: a Symbol (a type identifier) for the class name, a
Symbol (another type identifier) for the parent class, a
Features, and a Symbol for the filename in which the class definition
occurs. The phylum Features is defined to be a sequence of Feature's
by the declaration
phylum Features = SEQUENCE[Feature];See Section 6.2 for a description of the operations defined on AST sequences.
The class_ constructor returns an AST of type (or phylum) Class_. In cool.y there is the following example of a use of the class_ constructor:
class : CLASS classname superclass IS dummy_feature_list END ';'
{ $$ = class_($2,$3,$5,stringtable.add_string(curr_filename));}
The class_ constructor builds a Class_ tree node with the
four arguments as children. Because the phyla (types) of the arguments
are declared, the C++ type checker enforces that the class_ constructor is
applied only to arguments of the appropriate type.
See Section 6.5 and cool-tree.aps to learn the definitions of the other constructors.2
There is a real danger of getting confused because the same names are used repeatedly for different entities in different contexts. In the example just above, small variations of the name class are used for a terminal (CLASS), a non-terminal (class), a constructor (class_), and a phylum (Class_). These uses are all distinct and mean different things. There is also a class_ member of the union declaration in cool.y, which means yet something else. Most uses are distinguished consistently by capitalization, but a few are not. When reading the code it is important to keep in mind the role of each symbol.
Sequence phyla have a distinct set of operations for constructing and accessing sequences. For each phylum named X there is a phylum called Xs (except for Classes, which is a sequence of Class_ nodes) of type SEQUENCE[X]. Sequence creating functions are defined automatically for each sequence phylum. For the Classes phylum, the following non-member functions are defined:
Classes nil_Classes(); Classes single_Classes(Class_); Classes append_Classes(Classes,Classes);The function nil_phylum() returns an empty sequence of type phylum. The function single_phylum makes a sequence of length 1 out of its argument. The function append_phylum destructively appends two sequences of type phylum.
Each sequence phylum is a C++ class with the following members.
tree_list_node) is defined in tree.h.
All AST classes are derived from the class tree_node, defined in tree.h. The tree_node class definition contains everything needed in an abstract syntax tree node except information specific to particular constructors. There is a protected data member line_number, the line number where the expression corresponding to the AST node appeared in the source file. The line number is used by a Cool compiler to give good error messages.
Several functions are defined on all tree_nodes. The important functions are: dump, which prints an AST and get_line_number, which is a selector for the corresponding data member.
Each of the phyla is a class derived directly from tree_node. As stated previously, the phyla exist primarily to group related constructors together and as such do not add much new functionality.
Each of the constructors is a class derived from the appropriate phyla. Each of the constructor classes defines a function of the same name that can be used to build AST nodes. The dump function is also defined automatically for each constructor.
Each class definition of the tree package comes with a number of members. Some of the member functions are discussed above. This section describes the data members and some more (but not all) of the rest of the functions, as well as how to add new members to the classes.
Each APS constructor
is converted into a C++ class
![]()
_class with a C++ constructor taking the parameters
of the APS constructor. A non-member function named
is also
generated, taking the parameters of the APS constructor.
This non-member function simply returns new
_class(...).
Each APS constructor parameter
becomes a data member named
_
. The data members are protected and thus
are only visible to member functions
of the constructor's class or derived classes.
For example, the class__class class
has four data members:
Symbol _name; Symbol _parent; Features _features; Symbol _filename;You may add new data members or member functions to any class.
This section briefly describes each constructor and its role in the compiler. Each constructor corresponds to a portion of the Cool grammar. The order of arguments to a constructor follows the order in which symbols appear in productions in the Cool syntax specification in the manual. This correspondence between constructors and program syntax should make clear how to use the arguments of constructors. It may be helpful to read this section in conjunction with cool-tree.aps.
no_expr() as its entire body.
native.
when name : typeid => exprwhich corresponds to the field names in the obvious way. Use this constructor to build an AST for each branch of a case expression.
static_dispatch node is used for
super calls. The node requires an object (use self) and a type name
(use the name of the superclass of the current class).
elsif's. Instead
each elsif must expanded into another nested cond node.
For example:
no_expr node.
The first identifier should have the outermost let node and later
bindings occur inside. For example:
let i : Integer; j : Integer := i; in ... endshould be expressed as
let(#i,#Integer,no_expr(),
let(#j,#Integer,object(#i),
block(...)))
(where # means that the following identifier should be seen as
a symbol.)
- expressions.
= expressions are converted into
dispatches on a method named equals.
dispatch.
Since the name of the constructor is the same as its class, we have
situations such as the following example new Foo(void) is parsed
as:
dispatch(new_(#Foo),#Foo,single_Expressions(null()))
There are a few common errors people make using a tree package.
if (x == nil_Expression()) { ... }
is always false, because nil_Expression() creates a new empty sequence
each time it is used. To check whether a sequence is empty, use the
size() method (see tree.h).
if (x == no_expr()) { ... }
is always false, because no_expr creates a new AST each time
it is called. Define a virtual method to determine
the constructor of x (see Section 6.4).
The runtime system consists of a set of hand-coded assembly language functions that are used as subroutines by Cool programs. Under spim the runtime system is in the file trap.handler. The use of the runtime system is a concern only for code generation, which must adhere to the interface provided by the runtime system.
The runtime system contains four classes of routines:
The Cool runtime system is in the file $CLASSHOME/lib/trap.handler; it is loaded automatically whenever spim/xspim is invoked. Comments in the file explain how the predefined functions are called.
The following sections describe what the Cool runtime system assumes about the generated code, and what the runtime system provides to the generated code. Read Section 13 of the CoolAid manual for a formal description of the execution semantics of Cool programs.
| offset -4 | Garbage Collector Tag |
| offset 0 | Class tag |
| offset 4 | Object size (in 32-bit words) |
| offset 8 | Dispatch pointer |
| offset 12... | Attributes |
Figure 1 shows the layout of a Cool object; the offsets are given in numbers of bytes. The garbage collection tag is -1. The class tag is a 32-bit integer identifying the class of the object. The runtime system uses the class tag in equality comparisons between objects of the basic classes and in the abort functions to index a table containing the name of each class.
The object size field and garbage collector tag are maintained by the runtime system; only the runtime system should create new objects. However, prototype objects (see below) must be coded directly by the code generator in the static data area, so the code generator should initialize the object size field and garbage collector tag of prototypes properly. Any statically generated objects must also initialize these fields.
The dispatch pointer is never actually used by the runtime system. Thus, the structure of dispatch information is not fixed. You should design the structure and use of the dispatch information for your code generator. In particular, the dispatch information should be used to invoke the correct method implementation on dynamic dispatches.
For Integer objects, the only attribute is the 32-bit value of the integer. For Boolean objects, the only attribute is the 32-bit value 1 or 0, representing either true or false. The first attribute of String objects is an object pointer to an Integer object representing the size of the string. The actual sequence of ASCII characters of the string starts at the second attribute (offset 16), terminates with a 0, and is then padded with 0's to a word boundary. For Array objects, the first attribute is a pointer to an Integer object giving the number of entries in the array, then the actual entries follow afterwards.
The value void is a null pointer and is represented by the 32-bit value 0. All uninitialized variables (except variables of type Integer, Boolean, and String; see the CoolAid) are set to void by default.
The only way to allocate a new object in the heap is to use the Object.copy method. Thus, there must be an object of every class
that can be copied. For each class
in the Cool program, the code
generator should produce a skeleton
object in the data area; this
object is the prototype of class
.
For each prototype object the garbage collection tag, class tag, object size, and dispatch information must be set correctly. The attributes should be set to the default values for the specific type: for the basic classes Integer, Boolean, and String, the attributes should be set to the defaults specified in the CoolAid. For the other classes, the default value is void (zero).
The primitive methods in the runtime system expect arguments in register $a0 and on the stack. Usually $a0 contains the self object of the dispatch. Additional arguments should be on top of the stack, first argument pushed first . Some of the primitive runtime procedures expect arguments in particular registers.
Figure 2 shows which registers are used by the runtime system. The runtime system may modify any of the scratch registers without restoring them (unless otherwise specified for a particular routine). The heap pointer is used to keep track of the next free word on the heap, and the limit pointer is used to keep track of where the heap ends. These two registers should not be modified or used by the generated code--they are maintained entirely in the runtime system. All other registers, apart from $at,$sp,$ra, remain unmodified.
| Scratch registers | $v0,$v1,$a0-$a2,$t0-$t2 |
|---|---|
| Heap pointer | $gp |
| Limit pointer | $s7 |
The Cool runtime system refers to the fixed labels listed in Figure 3. Each entry describes what the runtime system expects to find at a particular label and where (code/data segment) the label should appear.
| Main_protObj | The prototype object of class Main | Data |
|---|---|---|
| Main.Main | The constructor for class Main | Code |
| $a0 contains the initial Main object | ||
| Integer_protObj | the prototype object of class Integer | Data |
| Integer.Integer | code that initializes an object of class Integer | Code |
| passed in $a0 | ||
| String_protObj | the prototype object of class String | Data |
| String.String | code initializing an object of class String | Code |
| passed in $a0 | ||
| _int_tag | a single word containing the class tag for the Integer class | Data |
| _bool_tag | a single word containing the class tag for the Boolean class | Data |
| _string_tag | a single word containing the class tag for the String class | Data |
| class_nameTab | a table, which at index (class tag)*4 contains a pointer | Data |
| to a String object containing the name of the class associated | ||
| with the class tag | ||
| bool_const0 | the Boolean object representing the boolean value false | Data |
| bool_const1 | the Boolean object representing the boolean value true | Data |
Use the following naming conventions for label generation:
<class>.<class> |
for constructor of class
<class> |
<class>.<method> |
for method <method> code of class
<class> |
<class>_protObj |
for the prototype object of class
<class> |
| Object.Object | Immediately returns. |
|---|---|
| Object.copy | A procedure returning a fresh copy of the object passed in $a0 |
| Result will be in $a0 | |
| Object.abort | A procedure that prints out the class name of the object in $a0 |
| Terminates program execution | |
| Object.type_name | Returns the name of the class of object passed in $a0 as a string object |
| Uses the class tag and the table class_nameTab | |
| Object.equals | Compares self passed in $a0 against the argument as a pointer. |
| Result in $a0: one of bool_const0 or bool_const1. | |
| IO.out_string | The value of the string object on top of the stack is printed to |
| the terminal. Does not modify $a0. | |
| IO.out_int | The integer value of the Integer object on top of the stack is printed |
| to the terminal. Does not modify $a0. | |
| IO.in_string | Reads a line from the terminal and returns a string object in $a0. |
| (The newline that terminates the input is not part of the string.) | |
| IO.in_int | Reads an integer from the terminal and returns the read Integer object |
| in $a0. | |
| String.equals | Compares $a0 against argument as strings. |
| String.concat | Returns a new string, made from concatenating the string object on top |
| of the stack to the string object in $a0. Return value in $a0 | |
| String.substr | Returns the substring of the string object passed in $a0, from index i |
| with length l. The length is defined by the integer object on top of | |
| the stack, and the index by the integer object on the stack below l. | |
| Result in $a0. | |
| Array.Array | Returns a new array in $a0 of size on top of stack. |
| Array.resize | Returns a new array in $a0 of size on top of stack. |
| Elements are copied over from array object in $a0. | |
| Array.get_item | Returns the element at index on top of stack |
| in array object in $a0. Result in $a0. | |
| Array.set_item | Stores the element on top of stack at index next in stack |
| in array object in $a0. Array object left in $a0. | |
| Integer.equals | Tests whether self in $a0 is same as argument as integers. |
| Boolean.equals | Tests whether self in $a0 is same as argument as booleans. |
| _dispatch_abort | Called when a dispatch is attempted on a void object. Prints the line |
| number, from $t1, and filename, from $a0, at which the dispatch | |
| occurred, and aborts. | |
| _case_abort | Should be called when a case statement has no match. |
| First prints the filename, from $a1 and line, from, $t1 | |
| Then it prints a message including the class name of the object | |
| (or void) in $a0, and execution halts. | |
| _case_abort2 | Called when a case is attempted on a void object. Prints the line |
| number, from $t1, and filename, from $a0, at which the case | |
| occurred, and aborts. |
The code generator must define Main.Main. Main.Main should execute the parent constructor and then execute its own body.
The Cool runtime environment includes two different garbage collectors, a generational garbage collector and a stop-and-copy collector. The generational collector is the one used for programming assignments; the stop-and-copy collector is not currently used. The generational collector automatically scans memory for objects that may still be in use by the program and copies them into a new (and hopefully much smaller) area of memory.
Generated code must contain definitions specifying which of several possible configurations of the garbage collector the runtime system should use. The location _MemMgr_INITIALIZER should contain a pointer to a initialization routine for the garbage collector and _MemMgr_COLLECTOR should contain a pointer to code for a collector. The options either no collection, generational collection, or stop-and-copy collection; see comments in the trap.handler for the names of the INITIALIZER and COLLECTOR routines for each case. If the location _MemMgr_TEST is non-zero and a garbage collector is enabled, the garbage collector is called on every memory allocation, which is useful for testing that garbage collection and code generation are working properly together.
The collectors assumes every even value on the stack that is a valid heap address is a pointer to an object. Thus, a code generator must ensure that even heap addresses on the stack are in fact pointers to objects. Similarly, the collector assumes that any value in an object that is a valid heap address is a pointer to an object (the exceptions are objects of the basic classes, which are handled specially).
The collector updates registers automatically as part of a garbage collection. Which registers are updated is determined by a register mask that can be reset. The mask should have bits set for whichever registers hold heap addresses at the time a garbage collection is invoked; see the file trap.handler for details.
Generated code must notify the collector of every assignment to an
attribute. The function _GenGC_Assign takes an updated
address
in register $a1 and records information about
that the garbage collector needs. If, for example, the attribute at
offset 12 from the $self register is updated, then a correct
code sequence is
sw $x 12($self) addiu $a1 $self 12 jal _GenGC_AssignCalling _GenGC_Assign may cause a garbage collection. Note that if this garbage collector is not being used it is equally important not to call _GenGC_Assign.