How C++ Compiler Works?


Every program we wrote needs to invoke a compiler to convert the source files into an executable file. Basically, the compiler takes each C++ source file in the working directory and compiles them into object files. The object files produced are then linked together with libraries and symbols to produce an executable file, which is our program.

Note that each source file will be compile into one object file. The picture below shows that compiler converts Main.cpp into Main.obj.

If our program has n source files, the compiler would produces n object files as a result. For example, the compiler will generates two object files if there are two source files provided. (Main.obj and Math.obj in the example below)

We can categorise the compiler works into three main stages. However, the actual compiling process involves more steps. The detailed process is not covered in this post, as this post is meant to be beginner friendly.

  • Stage 1 : Preprocessing
  • Stage 2 : Compiling
  • Stage 3 : Linking

Stage 1 : Preprocessing

At the first stage, the compiler will run the preprocessor on all the source files (only source files, no header files). Each C++ source file will then be built into a translation unit which resulted as object file at the later stage. A translation unit is just a preprocessed source file consists of an implementation file (.c / .cpp) and all the headers (.h / .hpp) that it included. It usually represented in a file with a .i suffix. ( Note that this file is hypothetical and only produced by compiler if we specifically requested. )

Here, the preprocessor will go through all our preprocessor directives and resolves them before compilation stage.


How preprocessor resolve #include

The most commonly used preprocessor directive would be #include, and It is crucial for every C++ developer to know how it works.

Let’s take a look at a simple Math.cpp that add 2 numbers,

Math.h

int num2 = 2;

Math.cpp

#include "Math.h"

int Add(int num1)
{
	return num1 + num2;
}

Note that Math.cpp has a preprocessor directive #include that include Math.h. Here, the processor will open Math.h, read all the contents inside, and paste it into our Math.cpp.

To have a better understanding on how it works, we could request the compiler to give us the preprocessed source file. Let’s have a look at Math.i

Math.i

#line 1 "D:\\wendi_blog_code_exp\\HelloWorld\\HelloWorld\\Math.cpp"
#line 1 "D:\\wendi_blog_code_exp\\HelloWorld\\HelloWorld\\Math.h"
int num2 = 2;
#line 2 "D:\\wendi_blog_code_exp\\HelloWorld\\HelloWorld\\Math.cpp"

int Add(int num1)
{
	return num1 + num2;
}

You may have noticed that “int num2 = 2” has been copied from Math.h to Math.cpp. That’s all the preprocessor does, it’s pretty simple.


Now, Let’s assume we have a Main.cpp that prints “Hello World” on screen:

Main.cpp

#include <iostream>

int main(int argc, char* argv[]) {
    std::cout << "Hello World" << std::endl;
    return 0;
}

And if we look at the size of preprocessed C/C++ Source files produced :

Note that the file size of Main.i(1.34MB) is much larger than Math.i(269 bytes) even though the line of codes are similar. That’s because we include a huge and massive <iostream> in Main.cpp.


Stage 2 : Compiling

After preprocessor done it’s job, the compiler will then take our C++ translation units and compiles them into object files. Theses object files in binary contain computer understandable machine code, which included instructions and metadata about the addresses of variables and functions (symbols). As we can see from Math.obj below, it contains binary data.

Math.obj

We can also request the compiler to generate the output in human-readable assembly listing files. The assembly code below are extracted from the generated assembly listing file Math.asm.

Math.asm

...
PUBLIC	?Add@@YAHHH@Z                   ; <--- Symbol name for Add function
...
...
_TEXT	SEGMENT
?Add@@YAHHH@Z PROC                      ; <--- Start of Add function
...
; Line 5
	mov	eax, DWORD PTR _num1$[ebp]      ; <--- Assembly instruction
	add	eax, DWORD PTR _num2$[ebp]      ; <--- Assembly instruction
; Line 6
...
?Add@@YAHHH@Z ENDP                      ; <--- End of Add function

We can see that it contains symbol for Add function, and the Add operation has been converted into assembly instructions. The first instruction move num1 to registry eax, and second instruction add num2 with num1 stored inside eax and update the result in eax.

Now with all the object files generated, the computer knows what to do and where the symbols located. The next stage is to link them together.


Stage 3 : Linking

Object files generated from compiler are standalone and unable to interact with each other, and it is the job of linker to link them together. In a nutshell, the linker links all object files and libraries together and create an executable file.

To have a better understanding on how linker works, let’s start with a simple example. Assume that we have an Add function definition in Math.h, which receives two integer parameters and returns the sum of them. (Of course in real life we won’t write code in this way, this is just an example to show how compiler and linker work.)

Math.h

int Add(int num1, int num2)
{
	return num1 + num2;
}

And we call the function in Main.cpp as below :

Main.cpp

#include <iostream>

int main(int argc, char* argv[]) {
    std::cout << Add(1, 2) << std::endl;
    return 0;
}

After we compiled the code, you probably noticed that there is an compilation error C3861 telling that ‘Add’ identifier not found, of course, because Main.cpp has no idea what Add is.

One of the ways to fix this is to simply copy the function signature into Main.cpp, to tell the compiler that Add is a function receives two int parameters and returns an int value.

#include <iostream>

int Add(int num1, int num2);    // <---- Add function signature  

int main(int argc, char* argv[]) {
    std::cout << Add(1, 2) << std::endl;
    return 0;
}

Now the compilation is succeed.

Next, let’s try to build it.

Note that we get a linking error LNK2019 telling that we have unresolved external symbol, named Add@@YAHHH@Z, which is our Add function. This result is expected since the linker doesn’t knows where to find the function required, as we only provide function signature. The linker needs to know where the function definition located.

Now let’s include Math.h instead, which contains the function definition, and build again.

#include <iostream>
#include "Math.h"    // <---- Add this line

int main(int argc, char* argv[]) {
    std::cout << Add(1, 2) << std::endl;
    return 0;
}

This time the build is succedded, and an executable file ( HelloWorld.exe ) is generated.