The .NET Runtime as a Compiler Target

John Gough, Queensland University of Technology

 

Abstract: the .NET runtime is intended to support a variety of languages, and to facilitate language interoperation through the Common Language Specification (CLS).

This white-paper describes the initial experience of creating a compiler hosted on the runtime. It was found that the runtime is an attractive target to compile to, with the facilities of the platform allowing rapid development.

 

Introduction

The Programming Languages and Systems Research Center at the Queensland University of Technology has a long association with the production of compilers for Pascal-family languages.  The Gardens Point range of compilers for Modula-2 and other languages have been available for more than ten years.  This paper describes the experience of creating a new compiler, for the language Component Pascal.  This new compiler currently supports two code emitters: one for the .NET runtime, and another for the Java Virtual Machine. Both targets are based around abstract machines, and rely on the use of just-in-time (JIT) compiler technology to achieve high performance at runtime. The compiler is written in Component Pascal, and compiles itself successfully.

 

The Source Language

Component Pascal is a Pascal-family language, which supports object oriented programming styles.  Its immediate ancestors are Oberon-2, Modula-2 and Pascal. The language has single inheritance, and allows independent compilation of modules. The language has a rich system of type annotations that are intended to make the programming of software components robust and type-safe.  In particular the language avoids many of the classic problems of object oriented programming, such as the so-called fragile base class problem.  Classes and methods may only be extended if they are annotated as extensible, and each new method declaration must declare whether it is intended to override an existing method, or is intended to be new.

 

Apart from any inherent interest in the language itself, the use of Component Pascal as a test case for the runtime platform is a useful one.  Pascal-family languages have a number of constructs that  do not occur in languages of the C-family, so that there are a number of new problems to be solved.  These issues include nested procedure definitions, non-local addressing, and a different notion of identifier visibility.

 

The Runtime Target

The .NET runtime consists of a rich set of APIs, and a framework which allows for language interoperation.  In the most common case, executable program code is loaded on demand, and is compiled by a JIT into native code instructions on the execution platform.  An important part of the platform is a PE Verifier that applies stringent tests to the program executable components to ensure type safety and correct operation. Programs which pass verification are guaranteed to be free from certain classes of type errors.  In effect, verified programs cannot break the encapsulation of types, leading to a system of component isolation that is extremely lightweight compared to, say, separate processes.

 

The runtime is fully garbage collected, giving a fair level of guarantee against the problems of memory leaks and premature deallocation that plague many programming environments.

 

The Common Language Specification (CLS) is a subset of the full capabilities of the runtime. It is intended that components which use only the facilities of the CLS are able to seamlessly interoperate with other software components which obey the same standards. The facilities of the CLS provide an environment that caters for the features of most imperative languages.

 

The software development kit comes with a number of tools to help program developers and compiler writers.  In our case, we chose to emit assembler text files as compiler output, and use the assembler ilasm to produce dynamically linked libraries.

 

The IL Abstract Machine

The abstract machine on which the runtime is based is a stack machine.  The operations of the machine manipulate the elements on an evaluation stack.  Such a design has a long history, with a number of compiler intermediate forms being of this kind.  In particular, the P-Code form, dating from the 1970s, was used for many if not most implementations of Pascal, and a similar intermediate form, U-Code is still used for many multi-language compiler frameworks.

 

Choosing an abstract stack machine as an intermediate form has a number of attractions. Firstly, it is very simple to emit such code during a traversal of tree-like representations of program structure.  Secondly, it is possible to interpret the output code by implementing an emulator for the abstract machine.  For small systems, or as a rapid prototyping path such interpreters have some attraction.  In general however such interpreters trade time for space, since the high program density of the “byte-code” form of the program is obtained at the expense of the overhead of an interpretative loop.

 

In the case of the .NET system, it appears that interpretation has not been intended as an option.  In particular, many instructions of the intermediate form are polymorphic.  For example, there is just one add instruction, which is used for long and short integers, and for the floating-point formats.  In order to interpret such codes directly it would be necessary to track the type of each stack element. Such a design would involve a crippling runtime overhead, unless some kind of preprocessing was used to expand the instruction stream to a non-polymorphic version.  Of course, such expansion might still lead to a viable platform for embedded systems use.

 

Gardens Point Component Pascal

The compiler, as noted, is written in Component Pascal.  The compiler core consists of a recursive descent parser, and builds an abstract syntax tree for the whole of the compilation unit.  Since imports of other modules are explicit in the language, data structures representing the interfaces to other components are created also.

 

The compiler annotates the abstract syntax during a second pass, enforcing conformance to the typing rules of the language.  A further computation performs data flow analysis on the code, ensuring proper initialization of local variables prior to use.  This data flow analysis also checks that method parameters that have been declared to be of “out” mode are assigned to along every path of the code leading to a method exit.  Finally, the compiler enforces the contracts implied by the “extensible” and “new” annotations of the source program.

 

Since this compiler has multiple targets, it was an important design goal to ensure that no target-specific code leaked forward into the target-neutral common core of the compiler.  The structure of the code defines an abstract class named “ClassMaker” which is subclassed for each of the output target languages.  The target language may then be chosen at runtime, and an appropriate class-emitter created and invoked.

 

Generating the Code

As noted earlier, producing code for an abstract stack machine is relatively simple.  However, compared to the traditional stack machines, such as P-Code, there are some other issues to be faced requiring a different understanding of the detail of the code generation process. For example, there is no facility for the address arithmetic that would traditionally access the interior of data structures. Instead, there are special instructions that access named fields, and index array elements.  Once this conceptual obstacle is overcome, the rest is easy.

 

The textual form of the intermediate language appears to be rather verbose, compared to traditional assembly languages.  This is inevitable, since data accesses have to be annotated with a declarative type to allow the verifier to ensure correctness. At a more obvious level, if the program wishes to access a particular field of a structure, then it must state the type of the structure, to enable the meaning of the field name to be clear.  Once the code is processed into the executable form these text-string annotations are replaced by indices into some kind of symbol table.  The executable form is thus quite dense, despite the appearance of the textual form.

 

In order to map the various entities of Component Pascal onto the primitives of the CLS it was decided that each module of source would be translated into a single program executable file.  Thus each module becomes a separate “assembly” with its own DLL file and its own namespace. Static features of the module, such as ordinary, non-virtual procedures and static data are treated as belonging to a synthetic class within the assembly. Likewise, each Record Type within the module, together with any methods defined on that type, becomes a separate class within the assembly.

 

Writing a code emitter for a verified abstract machine has some advantages that are non-obvious, but particularly powerful.  In a conventional setting, mistakes in the code of the code-emitter lead to programs that crash when executed.  In the case of a verified abstract machine, the verifier will reject such code, and diagnose the failure in some more or less helpful way.  The stand-alone verifier is thus the most powerful assistant in the development of the code emitter.  By the time this verifier is satisfied, the code if not correct is at least plausible.  The kind of errors that are detected by this verifier are such things as getting the types wrong on the stack, becoming confused between direct and indirect data accesses, and the insidious bugs which result when different paths through the control flow of the program lead to incompatible stack contents at a control merge point.

 

Interfacing to the CLS

There are some additional considerations that apply to the construction of a compiler that is intended to work to the common language specification.  For example, to be a consumer of CLS components, it is necessary to understand the semantics of various attributes defined in the CLS that do not exist in every source language.  In our case Component Pascal has no concept of interface implementation, nor of the semantics of the protected annotation for object fields.  In order to facilitate interworking, we modified the semantic attribution phase of the compiler so that it understands as much as necessary of CLS semantics.  It is possible for a Component Pascal record type to extend a CLS class, and to declare that the extension implements one or more CLS interface types.  Of course, the compiler then must perform checks to ensure that such programs fulfill the contracts that they have offered.

 

Preliminary Observations and Results

The compiler core consists of approximately 15000 lines of code, with approximately an additional 5000 lines for each target code emitter.  It can be bootstrapped on each of the target platforms. The source code is available for purposes of study and use.

 

At this preliminary stage, there are two significant omissions from the implemented language, as released.  These are procedure variables (function pointers) and non-local variable access in nested procedures.  Both of these features were successfully implemented in the first prototype, but have been removed from the current version while alternative designs are evaluated.

 

Preliminary benchmarks on the output code are extremely promising.  The execution speed of programs compares very well with native code on the same platforms.  For programs which do not allocate large numbers of objects the speed of typical small benchmarks is similar to the same algorithms implemented in native C or Modula-2.

 

To Find out More

More information, and the source code of the compiler, is available at the QUT web site.

http://www.plasrc.qut.edu.au  The author may be contacted at j.gough@qut.edu.au