Beka Modebadze | LLVM Infrastructure and Rust

LLVM is an engine behind many programming languages. It's used by C, C++, Rust, Go, Swift and others. This log is all about LLVM and I'll explore following topics:

What is LLVM Infrastructure

LLVM Infrastructure is a collection of modular and reusable compiler and toolchain technologies. LLVM umbrella contains several sub-projects like Clang, C/C++/Objective-C compiler, Debugger LLDB, libc++, etc. Decades of research and improvement created an amazing tool for building languages and compilers, which some of the modern technologies are taking advantage of.

LLVM project started in 2000 at the University of Illinois at Urbana-Champaign, by Vikram Adve and Chris Lattner. LLVM's name initially was the acronym for Low-Level Virtual Machine. Even though modern LLVM Infrastructure has nothing to do with virtual machines the name remained unchanged. But the abbreviation itself was later removed and LLVM became the full name of the project.

The LLVM Project was released under the Open Source License. Later Lattner was hired by Apple, and the whole team was assigned to work on the LLVM system for several uses within Apple products. LLVM started expanding its features and grew into an umbrella project which combines LLVM IR, LLVM Debugger, and LLVM API C++ Library.

LLVM Compiler Framework is a modular and reusable compiler framework using LLVM infrastructure to provide the end-to-end compilation of code. It is used to build, optimize, sanitize, and produce intermediate code (IR) or binary (machine) code (BC). LLVM allows the translation of programming language code into the intermediate representation (IR) which can be converted into binary code (BC) for any given hardware architecture. IR language is independent of the source and target languages.

LLVM is used in multiple ways. The main way it can be used is as a compiler framework. Here you provide, what's called, the front end and the back end, or machine code and it produces the IR. This IR can be converted into binary code and linked into machine-dependent assembly language for the target platform.

LLVM's intermediate representation (IR) is a low-level programming language similar to assembly. But in addition to assembly, it provides better type annotation and human-friendly syntax. Also, IR uses an infinite set of registered instead of a predefined number of registers. We will explore IR's internal workings in a later segment.

How LLVM Works

As mentioned already LLVM is a modular and reusable compiler framework that supports multiple front-ends and back-ends. The lifecycle of the program consists of writing a source code and then compiling it into binary code for execution. Our point of interest is the compiler stage. When source code is compiled into binary it goes through several steps in a subsequent order.

The compilation process can be divided into three parts. Front-End, Optimization, Back-End (Fig. 1-a). The compilation process starts on the front end. First, Pre-Processor starts to organize the source code. This is the phase when external libraries are extended into the target source code. In Rust, this directive is defined by the use statement. A similar directive in C and C++ is #include statement.

Figure 1-a. Program's Lifecycle

Second, is Parsing. At this stage the code is evaluated for the syntatic errors and the Abstract Syntax Tree (AST) is built. Third step at the front-end compilation is IR Generation. At this step compiler converts AST to the intermediate representation (IR) and outputs the result.

At the Optimization stage the compiler performs various transformations and sanitization to the program. This improves program performance, secures it by preventing various bugs, and does some utility runs. Later we'll explore IR and look at some optimization passes that come with compilers.

After the program is optimized it goes in the back-end stage. Here, Compiler Back-End converts IR to the target-specific assembly code. Then Assembler converts target-specific assembly code to target-specific machine code. And finally, Linker combines multiple machine code files into a single image, what we refer to as executable.

There are several language-specific Front ends. We have clang for C and C++, Go has gollvm, and Rust has it's compiler called rustc. Similarly after LLVM IR's optimizer passes code is converted into architecture-specific back-ends, like x86, ARM, MIPS.

The part between the front-end and back-end, called optimizer is where the magic of code optimization and sanitization happens. This is done through the series of what's called Pass. Each pass run one after another and there can be N number of passes. Passes can be categorized into two groups: Analysis and Transfromation. Analysis pass analyzes IR to check program properties, while Transformation pass transforms IR to monitor and optimize the program.

When source code is converted into LLVM IR it can take one of three formats (Figure 1-b). In-memory format is a binary which is suitable for front-ends during the compilation process. Other two formats are Bitcode and Assembly. Bitcode is a binary on-disk stored format, suitable for fast loading as it's more memory efficient. Assembly is a textual format for human readability.

Figure 1-b. LLVM IR Formats

You may want to know why to use IR/BC instead of Native Assembly and Native binary? The reason is Native assembly has one-to-one binding to the platform's machine code. It depends on the target's architecture, for example, the program's assembly for the x86 and assembly for ARM will be different. Native assembly is turned into native binary via assembler, the feature that LLVM also includes. Here is the extensive list of LLVM passes available out of box. But that's not all of it, you can implement your passes to sanitize, or optimize source code.

LLVM's Program Structure

LLVM IR, just like any other system, has its program structure (Fig. 2-a). The top-level container is a Module that corresponds to each translation unit of the front-end compiler. Module can consist of one or many functions. Each function has one or many basic blocks, which has instructions. Instruction is a single line and there are multiple instructions available in the IR language.

You can get the .ll (IR assembly) file from your Rust source code by running following command in the terminal:

$ rustc someprogram.rs --emit=llvm-ir -o somename.ll

Here we tell to rustc compiler to give us llvm-ir (this is done by the flag --emit=llvm-ir) form for the someprogram.rs and output it (using option -o) as someoname.ll. Similarly, you can emit bitcode by using --emit=llvm-bc flag.

The actual program structure in LLVM IR consists of hierarchical containers. At the top level, you have Module. It corresponds to each translation unit of the front-end compiler. Modules may be combined with the LLVM linker, which merges function definitions. Function is a next-level container, that includes function signature and its basic blocks. Basic Block is a set of instructions that are executed sequentially. Basic blocks have one entry and one exit. The first block is called the entry block. Finally, IR has instructions. Instructions are a single line executables (Figure 2-a).

Figure 2-a. IR Program Structure

In the IR file you'll encounter two kinds of variables, local variables, indicated with % symbol and global variables, indicated with @ symbol. LLVM IR can use an infinite number of temporary registers, instead of a predefined number of registers, as native assemblers do. IR's registers are defined by integer numbers, like 1,2,3,..N. For example, %2 = load i32, i32* %x means that value that is stored at the address of the local variable x is loaded into the temporary register 2.

Another advantage of LLVM IR is that it utilizes what's called Static Single Assignment (SSA) form. This allows various optimizations to be done on a code. Instead of regular variables that can be reassigned multiple times, SSA variables can only be assigned once. This allows compilers to do more efficient register allocation because it's easy to approximate a number of live variables at any given point. It can detect unused variables and prevent programs from unnecessary pointer allocation.

For example, the following code x = x * x in an SSA form will be x_2 := x_1 * x_1. If we get into the loop where the variable constantly gets a new value, SSA uses what's called Phi Nodes. Phi will merge the variables into the final value that will be returned/outputted from the branch or loop. We'll look at concrete examples and go over IR syntax in the next segment of this log.

LLVM and Rustc

The official guide to the Rustc development has all the reasons listed of why they are using LLVM, which I'm reciting directly:

  • We don't have to write a whole compiler backend. This reduces the implementation and maintenance burden.
  • We benefit from the large suite of advanced optimizations that the LLVM project has been collecting.
  • We can automatically compile Rust to any of the platforms for which LLVM has support. For example, as soon as LLVM added support for wasm, voila! rustc, clang, and a bunch of other languages were able to compile to wasm! (Well, there was some extra stuff to be done, but we were 90% there anyway).
  • We and other compiler projects benefit from each other. For example, when the Spectre and Meltdown security vulnerabilities were discovered, only LLVM needed to be patched.

Now, as we got familiar with what LLVM IR is, what id does, and why so many compilers are built on it, it's time to get our hands dirty and explore IR's internals using a simple Rust program called simple1.rs:

//simple1.rs

fn square(x: u32) -> u32 {
    x * x
}

fn factorial(mut n: u32) -> u32 {
    let mut acc = 1u32;
    while n > 0 {
        acc *= n;
        n = n - 1;
    }
     acc
}
    
fn main() {
    let s = square(4);
    let f = factorial(s);
}
    

Our program has three functions, main function that is an entry point in our program, square function that returns the square of a given number, and factorial that calculates factorial of the supplied integer. We output the LLVM IR file by invoking the following command in the terminal:

>$ rustc simple1.rs --emit=llvm-ir

Now we should have simple1.ll in the same folder as our source code. Let's open the .ll file in the text editor and start exploring. At the beginning of the file, you'll find a bunch of metadata and other information and directives related to the program, but we are interested in delving into how our Rust functions are translated to IR, and we'll start doing it by first looking at square function.

Search for the line which has a comment simple1::square and next to it you'll find the LLVM IR container of a give function. Which looks like this:

; simple1::square
; Function Attrs: nonlazybind uwtable
define internal i32 @_ZN7simple16square17h74fdb775aec225a0E(i32 %x) unnamed_addr #1 {
start:
  %0 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %x, i32 %x)
  %_4.0 = extractvalue { i32, i1 } %0, 0
  %_4.1 = extractvalue { i32, i1 } %0, 1
  %1 = call i1 @llvm.expect.i1(i1 %_4.1, i1 false)
  br i1 %1, label %panic, label %bb1

bb1:                                              ; preds = %start
  ret i32 %_4.0

panic:                                            ; preds = %start
; call core::panicking::panic
  call void @_ZN4core9panicking5panic17h50b51d19800453c0E([0 x i8]* ...
}

IR starts by defining function with a randomly generated unique name (that also includes original function name in it) that returns i32 value and takes i32 value as a parameter. The start is the label for the entry point of the function. In the first instruction LLVM IR calls its intrinsic function @llvm.umul.with.overflow.i32. This function takes two i32 arguments (in this case both ar x), performs unsigned multiplication and outputs tuple-like result, where first value is i32 and the second value is i1. The result is assigned to a unnamed temporary %0.

You may ask here a couple of questions. First, why does the square function take and return the i32 value instead of the u32 value as was declared In the source code? LLVM IR doesn't provide separately defined data types for signed and unsigned values. Instead, it performs operations specific to the value type, signed operation for signed values, unsigned operations for unsigned values. If our square function took i32 as an argument and returned i32, IR would have called @llvm.smul.with.overflow.i32 instead, which is an intrinsic function for signed multiplication.

The next question you may ask is why does multiplying an integer over an integer returns a tuple? If you look at the types of the outputted tuple, the second item is an i1 (a single bit value), which is for binary/boolean values. The name of the called function suggests that it does unsigned multiplication with overflow. If the overflow is detected, that is multiplied value exceeds the maximum unsigned value 32 bits can hold, because we performed unsigned multiplication, it sets i1 to 1, e.g. true.

In two following instructions IR unpacks the result of multiplication. The unnamed temporary %_4.0 takes the multiplied value and the boolean result is assigned to %_4.1. Next, IR calls another intrinsic function @llvm.expect.i1(i1 %_4.1, i1 false) where it expects value assigned to unnamed temporary %_4.1 to be false. Program assigns evaluated result to another unnamed temporary %1. Unnamed temporaries are unsigned numeric values with prefix % or @ (example, %1) created by IR. Named values are represented as string characters with similar prefixes introduced from a source code (example %acc).

After that, IR performs branching. The br directive checks if the value placed in temporary %1 is true, and if so jumps to %panic, otherwise to %bb1. Two above mentioned names are labels for the basic blocks. The %panic calls Rust's panic method (if you have programmed in Rust you should know what panic does), while bb1 returns value stored in temporary %_4.0 from the function square.

Now, let's look at our factorial function, as there are more interesting things happening:

; simple1::factorial
; Function Attrs: nonlazybind uwtable
define internal i32 @_ZN7simple19factorial17hceebae2a8a73b808E(i32 %0) unnamed_addr #1 {
start:
  %acc = alloca i32, align 4
  %n = alloca i32, align 4
  store i32 %0, i32* %n, align 4
  store i32 1, i32* %acc, align 4
  br label %bb1

bb1:                                              ; preds = %bb4, %start
  %_3 = load i32, i32* %n, align 4
  %_2 = icmp ugt i32 %_3, 0
  br i1 %_2, label %bb2, label %bb5

bb5:                                              ; preds = %bb1
  %1 = load i32, i32* %acc, align 4
  ret i32 %1

bb2:                                              ; preds = %bb1
  %_4 = load i32, i32* %n, align 4
  %2 = load i32, i32* %acc, align 4
  %3 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %2, i32 %_4)
  %_5.0 = extractvalue { i32, i1 } %3, 0
  %_5.1 = extractvalue { i32, i1 } %3, 1
  %4 = call i1 @llvm.expect.i1(i1 %_5.1, i1 false)
  br i1 %4, label %panic, label %bb3

bb3:                                              ; preds = %bb2
  store i32 %_5.0, i32* %acc, align 4
  %_6 = load i32, i32* %n, align 4
  %5 = call { i32, i1 } @llvm.usub.with.overflow.i32(i32 %_6, i32 1)
  %_7.0 = extractvalue { i32, i1 } %5, 0
  %_7.1 = extractvalue { i32, i1 } %5, 1
  %6 = call i1 @llvm.expect.i1(i1 %_7.1, i1 false)
  br i1 %6, label %panic1, label %bb4

panic:                                            ; preds = %bb2
; call core::panicking::panic
  call void @_ZN4core9panicking5panic17h50b51d19800453c0E ...

bb4:                                              ; preds = %bb3
  store i32 %_7.0, i32* %n, align 4
  br label %bb1

panic1:                                           ; preds = %bb3
; call core::panicking::panic
  call void @_ZN4core9panicking5panic17h50b51d19800453c0E  ...

Recall that as an argument, factorial takes mutable u32 value, which in IR is defined as a unnamed temporary %0. While function square's definition takes named variable %n as an argument, just like in a source code. This is because square never modifies its value, while factorial requires it to be mutable, and later makes argument's copy inside a function's scope. At the start of the function two named variables are defined %n and %acc. Each of them gets the allocated (with alloca call) 32 bits of memory with data alignment of 4 bytes. You can read more about memory alignment here.

After allocation, store instruction stores content laying in %0 temporary in the address of %n. In the address of %acc it stores constant value 1. Finally, it jumps to the entry point of a while loop. The bb1 label marks entry into the while condition. IR loads value stored at the address of %n into temporary %_3 and makes comparison between it and constant value 0. icmp checks if value stored at %_3 is unsigned greater than (ugt) 0 and assigns result to another temporary %_2. After that, it evaluates contents of %_2. If it's true, branch jumps to the label %bb2 if false - to label %bb5.

The bb5 label is marks the exit block of the while loop. Here it loads the value from the address of %acc into temporary %1 and returns it from the function. The label bb2 is the entry into the body of a while loop. The body of a while loop is split in two halves. First one, responsible for the multiplication operation (bb2 label) and the second one responsible for decrementing n, or subtraction operation. The operation n = n - 1 is done by calling another intrinsic function for unsigned subtraction. The rest are the same operations as we saw in square function (Fig 4-a).

Figure 4-a. Control Flow Graph & Basic Blocks

You'd noticed the SSA form in action here. For every assignment instruction, instead of reassigning value back to n and acc, as it's defined in the source code, IR introduces new unnamed temporary for each instruction. The numbering of unnamed temporaries is incremented within a function at each instance when they are spawned, starting from 0.

LLVM IR supports Two's Complement of binary numbers. It has data types for integers like i8, i16, i32, i64 and floats f16, f32, etc. which are referred es operands. If you see an asterisk symbol after integer type that means we are dealing with a pointer (example: i32*). Value can also be constant. Each instruction also has its types, for example, arithmetic operators, binary comparison, data stores, and loads. There is much more of what IR performs and you can find the full list on the link at the end of this section.

The class hierarchy of LLVM is little bit complicated. The base object is what's called Value, that is used to represent a SSA register. All the other classes inherit from it. Then comes User class followed by Instruction, Operator, Constant classes and so on (Fig. 4-b). Here is the full tree of class inheritance inside LLVM.

Figure 4-b. LLVM Class Hierarchy

LLVM Language has many high-level structures, types, constants, functions, codegen instructions, etc. If you are interested learning more about it you can scroll through the reference manual.

Summary

As we reviewed in this article LLVM IR has many use-cases and allows us to analyze and optimize source code through its passes. Knowing IR language itself will help us to write our passes and build projects around it for debugging, testing, optimizing. Currently, LLVM IR doesn't have Rust API. It's mainly used through the C++ library. However, some user-created repos are available on crates.io. There is a Rust binding to LLVM's C API - llvm-sys and two other, more Rusty APIs that are using LLVM: inkwell and llvm-ir. And finally, if you want to learn how to write a LLVM pass you should start here.


Enjoy the article?

Give me a clap and post a comment

Views: 4995, Comments: 3 Add Comment