Programs are the most complicated engineering artifacts known. A compiler is a special type of program. It validates, optimizes and transforms programs into executable code. Compilers teach us how to solve complex problems.
Compiler design is full of beautiful examples where complicated real-world problems are solved by abstracting the essence of the problem mathematically. Aho et al., p. 15
A compiler operates as a sequence of phases, each of which transforms the source program from one intermediate representation to another. Aho et al., p. 36
Compile-time errors are special. Instead of runtime errors, which mean that your program is causing trouble in the world, compile-time errors prevent trouble from ever happening. They lead to high-quality programs.Compile-Time Error
Such is the complexity of compiler theory that it is popularly represented as a dragon. With compiler design techniques, such as syntax directed translation, we can slay this complexity—or at least make it manageable.
Note: Quotes from Aho on this site are taken from the dragon book, Compilers: Principles, Techniques, and Tools.Dragon Book
Note 2: Quotes from Abelson and Sussman are taken from Structure and Interpretation of Computer Programs.Structure and Interpretation of Computer Programs
Note 3: Quotes from Lidin are taken from Expert .NET 2.0 IL Assembler. This book describes low-level details.Expert .NET 2.0 IL Assembler
We next examine the C# compiler and its application of compiler theory. When you compile a C# program, the program is translated into an abstract binary format. This format, called intermediate language, must then also be translated.
Compiler theory divides the compilation of programs into several different phases. At first, the program must be read from the text file, and then important characters are recognized as lexemes.
Lexeme: The term lexeme is used to refer to the textual representation of a token.
A token is a structure that combines a lexeme and also information about that lexeme. After the tokens are determined, the compiler uses internal data structures (intermediate representations) to improve the form of programs.
Note: Lexical refers to the text representation of programs. Lexeme refers to the text representation of keywords and more.
And: Tokens combine lexemes and symbolic information about lexemes. The symbol table stores information about tokens.Token
Next we apply at a high level the compiler phases to the C# compiler system typically used, such as in the .NET Framework. When you compile a C# program in Visual Studio, the csc.exe program is invoked on the program text.
Next: All the compilation units are combined in a preliminary step. The C# compiler proves errors in your program.
Programs are interpreted at compile-time and runtime. Compile-time analysis is static, meaning not dynamic. Runtime analysis is dynamic. Static analysis does not impact performance of execution. Runtime analysis can slow programs down.
The C# compiler uses a process called definite assignment analysis. Here, it proves that variables are not used before they are initialized. This step significantly reduces the number of security problems and bugs in C# programs.
Tip: Definite assignment analysis ensures higher program quality because the programs are tested more at compile-time.Definite Assignment
The C# compiler applies certain inferential logic at compile-time. This has no penalty at execution. It finds the best overloaded method based on its parameters, or the best overloaded method based on the type of its parameters.Overload
At the C# compilation stage, number transformations are applied. Numbers are "promoted" to larger representations to enable compilation with certain operators. And some casts not present in the program text are added.
Note: This is done to enable shorter and clearer high-level source code, and to ensure an accurate lower-level implementation.Numeric Promotion
The compiler uses node-based logic to rearrange conditional statements and loops, which both use jump instructions. Code often will be compiled to use branch instructions that do not reflect exactly the source text.If Loop Constructs: For, While and Foreach
For example, the C# compiler will change while-loops into the same thing as certain for-loops. It has sophisticated logic, presumably based on graph theory, to transform your loops and nested expressions into efficient representations.
In compiler theory, some levels of indirection can be eliminated by actually injecting the constant values into the representation of the program directly. This is termed constant folding.
Note: My benchmarks have shown that constant values do provide performance benefits over variables.
And: If you look at your compiled program, all constants will be directly inside the parts of methods where they were referenced.
String literals are pooled together and constant references to the stream of string data in the compiled program are placed where you used the literals. Therefore, the literals themselves are not located where you use them in methods.
Instead: The string literal in your program is transformed into a pointer to pooled data.String Literal
A C# program is compiled into a relational database called the program metadata. This is also considered an abstract binary representation. The metadata is an efficient encoding of the program from the C# compiler.
And: The metadata is stored on the disk, and it does not contain comments in your source code.
The metadata is divided into many different tables. These tables contain records that point to different tables and different records. It is not typically important to study the metadata format unless you are writing a compiler.
Structural programming, which represents logic as procedure calls, uses methods. In the metadata, method bodies do not store the names of their local variables. This information is lost at compile-time. Parameter names are retained.
Note: The goal was to improve the level of optimization on method bodies and eliminate unneeded information, reducing disk usage.Methods
We have taken a high-level C# source text and translated it into a relational database called metadata. When you execute this metadata, the Common Language Runtime for the .NET Framework is started, which incurs a lot of overhead.
Then: As you run the program each method is read from the metadata. Intermediate language code is translated into machine-level code.
Just-in-time compilation. The Common Language Runtime (CLR) applies several optimizations to the methods. It will sometimes insert the methods at their call site in an optimization called function inlining.
Also, the system rewrites instruction layouts in memory to eliminate unnecessary indirections. Each pointer dereference costs time. By removing this dereference, fewer instructions are needed. Fewer clocks are then required at runtime.
Note: The JIT system does cause a slowdown when first used. Therefore, it is most beneficial on long-running programs..NET
These articles help us understand compilers and some of the optimizations they can achieve. We gain insight into how loops are analyzed, as with induction variables and data dependencies. We also look into methods and inlining.Code Motion Induction Variable JIT Compilation Optimization Misnomer: Compiler Theory
We explored compiler theory as it applies to the C# language and .NET Framework. We looked at the elaborate series of phases, at compile-time and runtime, through which each C# program passes.
And: Modern computers, and all computer software, revolve around compiler theory. It is situated at the very core of all software.