C# Compiler Explanation

.NET Compiler

Compiler Principles

Programs are the most complicated engineering artifacts known. A compiler is a special type of program. It validates, optimizes and transforms programs into executable code. Compilers are worth studying. They teach us how to solve problems of tremendous complexity.

Compiler design is full of beautiful examples where complicated real-world problems are solved by abstracting the essence of the problem mathematically. Aho et al., p. 15

Errors

Compile-time errors are special. Instead of runtime errors, which mean that your program is causing trouble in the world, compile-time errors prevent trouble from ever happening.

Compile-Time Error

Optimizations

Steps

These articles help us understand compilers and some of the optimizations they can (but sometimes don't) achieve. We gain insight into how loops are analyzed, as with induction variables and data dependencies; we also look into methods and inlining.

Code Motion Induction Variable JIT Compilation Optimization Misnomer

Books

Such is the complexity of compiler theory that it is popularly represented as a dragon. With compiler design techniques, such as syntax directed translation, we can slay this complexity—or at least make it manageable.

Dragon Book Expert .NET 2.0 IL Assembler Review Structure and Interpretation of Computer Programs

Note: Quotes from Aho et al. on this site are taken from the dragon book.
Quotes from Abelson and Sussman are taken from Structure and Interpretation of Computer Programs.

C# compiler

The C# programming language

Next, we examine the C# compiler and its application of compiler theory. When you compile a C# program, the program is translated into an abstract binary format, but this format, called intermediate language, must then also be translated. We describe some steps of this process.

A compiler operates as a sequence of phases, each of which transforms the source program from one intermediate representation to another. Aho et al., p. 36

Compiler phases. Compiler theory divides the compilation of programs into several different phases. At first, the program must be read from the text file, and then important characters are recognized as lexemes. The term lexeme is used to refer to the textual representation of a token. The term token refers to a structure that combines a lexeme and also information about that lexeme. After the tokens are determined in the program text, the compiler can use internal data structures called intermediate representations to change the form of programs so it is more efficient.

Note: Lexical refers to the text representation of programs. Lexeme refers to the text representation of keywords and more. Tokens combine lexemes and symbolic information about lexemes. The symbol table stores information about tokens.

Note (please read)

C# compiler phases. Here, we apply at a high level the compiler phases to the C# compiler system typically used, such as in the .NET Framework. When you compile a C# program in Visual Studio, the csc.exe program is invoked on the program text. According to the rules of the language specification, all the compilation units are combined in a preliminary step to ensure discovery of all parts of the program. The C# compiler tries its hardest to prove errors in your program: these are termed compile-time errors.

Note: Programs are interpreted at compile-time and runtime. Compile-time analysis is static, meaning not dynamic. Runtime analysis is dynamic. Static analysis does not impact performance of execution. Runtime analysis can slow programs down.

Warning

Compile-time errors. For example, the C# compiler uses a process called definite assignment analysis to prove that variables are not used before they are initialized. This step alone reduces the number of security problems and bugs in C# programs substantially; definite assignment analysis ensures higher program quality because the programs are tested more at compile-time.

Type inference. The C# compiler also can apply certain inferential logic at compile-time, and because this is not used at runtime, it has no penalty at execution. For example, the C# compiler will use algorithms to find the best overloaded method based on its parameters, or the best overloaded method based on the type of its parameters.

Numeric promotion. At the C# compilation stage, certain number transformations are also applied. Numbers are "promoted" to larger representations to enable compilation with certain operators. Also, some casts that are not present in the program text can be added by the C# compiler. This is done to enable shorter and clearer high-level source code, and to ensure an accurate lower-level implementation.

Numeric PromotionIf keyword

If-statements and loops. The C# compiler also uses node-based logic to rearrange conditional statements and loops, which both use jump instructions. For this reason, your code often will be compiled to use branch instructions that do not reflect your source text exactly. For example, the C# compiler will change while-loops into the same thing as certain for-loops. It has sophisticated logic, presumably based on graph theory, to transform your loops and nested expressions into efficient representations.

If Loop Constructs

Constant folding. In compiler theory, some levels of indirection can be eliminated by actually injecting the constant values into the representation of the program directly. This is termed constant folding and my benchmarks have shown that constant values do provide performance benefits over variables. If you look at your compiled program, all constants will be directly inside the parts of methods where they were referenced.

String type

String literals. In the C# compiler, string literals are actually pooled together and constant references to the stream of string data in the compiled program are placed where you used the literals. Therefore, the literals themselves are not located where you use them in methods but the literal is transformed into a pointer to pooled data.

String Literal

C# metadata

.NET Framework information

In the .NET Framework, your C# program is compiled into a relational database called the program metadata. This is also considered an abstract binary representation. The metadata is an efficient encoding of the program text that the C# compiler generates. The metadata is stored on the disk, and it does not contain comments in your source code.

Relational database. The metadata is divided into many different tables, and these tables contain records that point to different tables and different records. It is not typically important to study the metadata format unless you are writing a compiler.

Book: See "Expert .NET 2.0 IL Assembler" by Serge Lidin. This book explains the metadata and assembly format for .NET.

Method call

Method representation. Structural programming, which represents logic as procedure calls, uses methods extensively. In the metadata, method bodies do not store the names of their local variables; this information is lost at compile-time. Parameter names are retained. The goal was to improve the level of optimization on method bodies and eliminate unneeded information, reducing disk usage.

Method Tips

.NET runtime

At this point, we have taken a high-level C# source text and translated it into a relational database called metadata. When you execute this metadata, the Common Language Runtime for the .NET Framework is started, which incurs a lot of overhead. Typically then, as you run the program each method is read from the metadata and the intermediate language code is translated into machine-level code.

Just-in-time compiler (JIT)

Just-in-time compilation. The Common Language Runtime (CLR) applies several optimizations to the methods. It will sometimes insert the methods at their call site in an optimization called function inlining. The system will actually rewrite the instruction layouts in memory to improve efficiency and eliminate unnecessary indirections. This is because each pointer dereference costs time; by removing this dereference, fewer instructions are needed. Fewer clocks are then required at runtime.

.NET

Note: The JIT system does cause a slowdown when first used. Therefore, it is most beneficial on long-running programs.

Summary

Note

We explored compiler theory as it applies to the C# language and .NET Framework. We looked at the elaborate series of phases at compile-time and runtime each C# program is taken through. Modern computers, and all computer software, revolve around compiler theory, situated at the very core of all software.

.NET