C++ and Python are two of the most commonly used programming languages within the developer community. While each of these languages has its own devout community trying to convince you that one is better than the other, you might be surprised to find out that there is a lot of commonality between the two headline languages.
In this article, I’ll break down and explain the details of what happens when you compile a C++ program using the GNU Compiler Chain (more commonly known as gcc), then switch over to Python and look at what happens when the Python interpreter is called in the CPython implementation. By the end, you’ll hopefully have a better understanding of what it takes to actually compile and run your “hello world” program in both of these languages!
Compiled vs Interpreted
Before looking at some examples, it’s important to understand the fundamental difference between a compiled and interpreted language. C++ is compiled, so there are a couple of steps that happen before you can actually run the code written in a file. Python, on the other hand, is interpreted…kind of. The first step of running a Python script involves compiling it to what’s known as bytecode, but Python has the unique property over C++ that it can be run interactively.
First, a little background about what it takes to compile something like C++. This entire process is completed in 4 main steps: Preprocessing, Compiling, Assembling, and Linking.
In the preprocessing step, your code is simplified and minimized before being sent to the compiler. This includes the removal of comments, expansion of macros, prepending the include files, checking any conditional compilation flags like
#ifdef, etc. There’s a lot of preprocessor and compiler directives (things like
#define), and you can find these in the ISO C Standard, but most preprocessors throw in a few extras for added flavor. If you want GCC’s full list, you can find it here. Here’s an overly complicated bit of C code before and after the preprocessor makes this substitution.
What are these leading
#'s for? I found this answer in the preprocessing section of the man page for GCC, specifically for the
-P option (which I haven’t used here):
-P Inhibit generation of linemarkers in the output from the
preprocessor. This might be useful when running the
preprocessor on something that is not C code, and will be
sent to a program which might be confused by the linemarkers.
These are linemarkers (in the form of
# linenum sourcefile flags). These linemarkers are sent to the compiler to generate an error string so you know what line your code had an error on. (full documentation on this here)
In the compilation step, the compiler takes the preprocessor output and compiles it to assembly. The assembly used will vary based on the processor platform of your machine, but usually it’s a variant of x86, RISC or MIPS. GCC introduces a lot of ‘junk’ in assembly code, and if you want to see what I’m talking about, go look at the code it generated from the C++ snippet above, it’s here. I’ve excluded it from this article because it’s 40 lines long…and all it’s doing is allocating memory on the heap and returning. This has to do with the optimizations and methods used to generate more complicated assembly, so with a little assembly knowledge, it’s often more efficient to write the assembly code by hand and just skip C++ altogether. The trade off being the time it takes you to actually write good assembly!
This step is also where the C++ language is parsed. Since C++ isn’t a context-free language, the compiler enforces the language (simply put) by a set of rules. If you end up with compiler errors (like a missing semicolon), it’s the compiler saying “Hey! You’re not following my rules!” Often times, when one rule is broken (like the case of the missing semicolon), this leads to cascading errors with trickle down effects. This is why you’ve probably been told to start at the top of the list of errors from a compiler when debugging.
The output of the compiler is just another programming language, in this case plaintext assembly. In the assembly step, the assembler assembles these plaintext opcodes, memory addresses, control bits, expressions, etc. into their numeric equivalent, assembling into what is known as object code. The GAS, GNU ASsembler as it’s known, performs these operations in one pass unlike many other modern assemblers. We can read the symbols that are inside of this binary object file by using the
> nm assembled.o
0000000000000000 T main
This tells us that ‘main’, the function name, is in the text section and has the value of
0 (the memory address). This memory address is really only an offset in the global data section.
Linking is the final step in creating an executable. The linker finds the definitions of all functions and ‘links’ them to function calls within other functions, system calls, and external libraries where necessary. The final step of the linker is to include code to setup the entry and exit of the program. This sets up the stack with environment variables and command line arguments. Essentially, it puts all of the pieces of the puzzle together and spits out a binary executable. We can also read the symbols inside the compiled binary using
nm, but this time we’ll see these entry and exit symbols.
> nm linked
000000000000037c r __abi_tag
0000000000004010 B __bss_start
0000000000004010 b completed.0
0000000000004000 D __data_start
0000000000004000 W data_start
0000000000001070 t deregister_tm_clones
00000000000010e0 t __do_global_dtors_aux
0000000000003df8 d __do_global_dtors_aux_fini_array_entry
0000000000004008 D __dso_handle
0000000000003e00 d _DYNAMIC
0000000000004010 D _edata
0000000000004018 B _end
00000000000011b8 T _fini
0000000000001120 t frame_dummy
0000000000003df0 d __frame_dummy_init_array_entry
000000000000212c r __FRAME_END__
0000000000003fc0 d _GLOBAL_OFFSET_TABLE_
0000000000002004 r __GNU_EH_FRAME_HDR
0000000000001000 t _init
0000000000003df8 d __init_array_end
0000000000003df0 d __init_array_start
0000000000002000 R _IO_stdin_used
00000000000011b0 T __libc_csu_fini
0000000000001140 T __libc_csu_init
0000000000001129 T main ## <<<<<< main is here!
00000000000010a0 t register_tm_clones
0000000000001040 T _start
0000000000004010 D __TMC_END__
Let’s talk about Python. The Python process (in the eyes of the user) is simple (one of the reasons that it took off in the computer science community), under the hood, however, there’s a lot going on. The Python execution process varies based on the implementation, and yes there are many! In this article, I’ll walk through the CPython implementation which is the official implementation that you can download from python.org; CPython also contains the official docs and PEP standards for the language, all of which are open source.
Python isn’t entirely an interpreted language; there’s some compiling that happens to prevent extra computation the next time you run the same code. That said, Python can be run as an interpreted language, so call it what you will but Python is typically referred to as such. The Python execution process can be broken down into 5 major steps: Lexing and Parsing, Forming Abstract Syntax Trees, Compiling, Assembly, and Execution.
- Lexing and Parsing
The first step in running any Python script is to run a lexer that converts the script into what are known as tokens. These tokens are based on rules set in the python language, like an empty list can be initialized with
, likely having the same token since it has the same function. This then goes to the parser which forms what is known as a “concrete syntax tree”. We can actually view this syntax tree, although it’s kind of nonsensical to us. Let’s look at the example of
I’ve written a script that lets you convert these expressions to a readable format. We run this and it starts to make a little more sense!
2. Forming Abstract Syntax Trees
Now we have a concrete syntax tree, but that doesn’t give you enough information to execute the code written. The concrete syntax tree can tell you that you have an open parenthesis, a variable, and a close parenthesis, but you don’t know if that is part of an if statement, a function call, or a function definition. That’s where the abstract syntax tree comes in!
Converting the above expression into a asinine function, we get the following piece of code that can be visualized as an abstract syntax tree using a package called “instaviz”.
From the abstract syntax tree, Python has enough information to compile this into something that the CPU can execute (and yes, compiling is the right word here. It’s compiling the abstract syntax tree. For Python. Let that sink in). An important question to ask here is “what are we compiling to?” The answer is rather complicated. In short, the abstract syntax tree is being compiled to a control flow graph. A control flow graph is a simple representation of a program where each vertex is a basic block of code and each edge represents a branch, whether it be a function call, if statement, or jump. In this case, the basic blocks of code are instructions called “bytecode” because they were originally a single byte long. Since python3.6, however, the bytecode instructions were expanded to be 2 bytes long.
We can view the disassembled bytecode in instaviz like before:
The structure of the context flow graph output from the compiler is what is known as
frame_block's. These blocks contain things like the list of instructions in the current block, the address of the previous and next block, if the block has a return code, the instruction offset, the depth of the stack when this block was entered, etc. The assembler performs a depth-first-search of this control flow graph and merges these blocks into a single bytecode sequence. This bytecode is then sent to an optimizer that, well, optimizes the bytecode. For example, if you have
a=1+2 it may optimize the bytecode to
a=3, reducing the overall computation for the program.
Once these code objects have been created, they’re ready for execution: the final step of the Python process. Python uses the idea of stack frames, a data structure that is passed between functions during execution. Frames are generated based on the code objects created in the previous step. These frames include information about global and local variables, function arguments, function names, etc. and are passed through the program in order of control flow with the first frame including argv and argc, the command line arguments and subsequent frames including function
Python, like C++, is a language with features that one may never fully understand. Bjarne Stroustrup, the creator of C++, in a talk about the language itself says that ‘If someone claims to have the perfect programming language, he is either a fool or a salesman or both.’ I firmly believe this to be true, and by extension that there are use cases for different languages. My goal here isn’t to convince you that one is better than the other, but to show you some of the differences and give you a better idea of what’s going on when you brainlessly call
python on the terminal to compile and run the program you’ve just spent the last 3 hours working on only to get a compiler error.
Each of the steps in both of these language processes is incredibly complex and involved and has years of development driving the end result. Even here, I’ve left out steps that are critical to the compilation and execution process, but in truth there is just not enough space to discuss both of these languages in enough detail to do them justice.
The amazing thing about the open source movement, however, is that if you find any part of these processes interesting, you’re not only encouraged to read more about it, but contribute to it! Both, the GNU Compiler Chain and CPython, are open source and have some great resources to understand the source code and start contributing!