Whenever you compile using Rust, the compiler goes through different passes and in the end, generated binary code for the target processor. By default, it uses LLVM as backend to generate the binary code, but more backends exist like cranelift and GCC. This post is about how it's possible for one compiler to use different backend to generate binaries, in particular GCC.
Before going into details, we need to describe how compilers actually work. They read source code and convert it internally into a format they can manipulate, commonly called Abstract Syntax Tree (shortened "AST").
However, compilers go through multiple passes, and often each pass has their own AST. Let's take a short and very incomplete example with the Rust compiler passes. We have 4 steps (again, this is simplified!):
Each step generates a new AST with new information if no error was encountered and provides it to the next pass.
Little side-note: If enough people are interested by this topic, I can write a (much) longer explanation of these passes.
So now that we have a high-level idea of Rust compiler passes, what is the difference between "front-end" and "back-end" exactly?
We consider the front-end to be the part handling (high-level non-exhaustive list) code parsing, linting, type-checking and borrow-checking (steps 1 to 3). When all this is done, it means the code is valid and needs to be translated to the target processor instructions set. To do so, we call LLVM/GCC which will translate the Rust compiler AST into assembly code (step 4).
The Rust compiler backends are the bridge between the Rust compiler AST and the actual code generator. They receive the AST and call the LLVM/GCC/... API which will in turn run their passes, optimize and finally generate the assembly code.
LLVM being much more recent than GCC (2003 vs 1987), a lot of older processors are not supported and will never be. So if you want to write a Rust program on an old platform like Dreamcast, you have no choice to either write your own backend or use the GCC backend (or the gccrs front-end once ready).
For the readers interested in doing so, there is a guide explaining how to build Rust programs for Dreamcast here.
The GCC backend is different than gccrs which is a front-end for GCC written in C++, which doesn't reuse the front-end of rustc, meaning they need to reimplement parsing, type-checking, linting, borrow-checking, compilation errors, etc.
On the other hand, the GCC backend (the crate name is rustc_codegen_gcc) is just "yet another backend codegen" of the Rust compiler, like LLVM or Cranelift, only meant to generate the binary from the Rust compiler input. It's a bridge between Rust compiler's AST and the codegen API.
On that note: GCC doesn't provide a nice library to give access to its internals (unlike LLVM). So we have to use libgccjit which, unlike the "jit" ("just in time", meaning compiling sub-parts of the code on the fly, only when needed for performance reasons and often used in script languages like Javascript) part in its name implies, can be used as "aot" ("ahead of time", meaning you compile everything at once, allowing you to spend more time on optimization). To do so we use bindings, which are split in two parts:
gccjit-sys which redeclares the C items we need.gccjit which provides a nice API over gccjit-sys.If you want to write your own compiler and use GCC as codegen, you can do it thanks to libgccjit. And if you write it in Rust, you can even use the Rust bindings.
Rustc has a crate named rustc_codegen_ssa which provides an abstract interface that a backend needs to implement through traits like:
The full list is available here.
One last thing you need to write in your backend:
Run#[no_mangle]
pub fn _rustc_codegen_backend() -> Box<dyn CodegenBackend> {
// This is the entrypoint.
}
This is the function that will be called by rustc to run your backend.
Let's take an example: how the GCC backend creates a constant string. I picked this one because it's small enough to showcase how things work while not being too much information to digest at once.
In the ConstCodegenMethods trait, there is a const_str method. This is the method we will implement to declare a constant string.
So the method implementation so far looks like this:
Runimpl<'gcc, 'tcx> ConstCodegenMethods for CodegenCx<'gcc, 'tcx> {
/// Returns the pointer to the string and its length.
fn const_str(&self, s: &str) -> (RValue<'gcc>, RValue<'gcc>) {
// Call GCC API to declare this string.
}
}
We need to pause here to give some extra explanations: CodegenCx is the type on which most rustc_codegen_ssa traits will be implemented. It is created in each ExtraBackendMethods::compile_codegen_unit call and passed down from there to generate the code for this module. You can consider it the same as a cache. It keeps the list of items declared, like functions, types, globals, etc. But also information such as "boolean type", "i8 type" and equivalents so we don't need to recompute them every time we need them.
Ok so now let's actually implement it. We have a few things to do:
*const u8) into the C type (*const char).Let's translate it into code with a lot of comments to help understanding what's going on:
Runfn const_str(&self, s: &str) -> (RValue<'gcc>, RValue<'gcc>) {
// We get the const string cache.
let mut const_str_cache = self.const_str_cache.borrow_mut();
// We get the address of the stored string and we add it to the cache and
// return its address.
let str_global = const_str_cache.get(s).copied().unwrap_or_else(|| {
// We call the `GCC` API to create a new const string.
let string = self.context.new_string_literal(string);
// We name the const.
let sym = self.generate_local_symbol_name("str");
// We declare it.
let global = self.declare_private_global(&sym, self.val_ty(string));
// All done, we can add it to the cache and return it.
const_str_cache.insert(s.to_owned(), global);
global
});
let len = s.len();
// We cast the pointer to the target architecture string pointer type.
let cs = self.const_ptrcast(
str_global.get_address(None),
self.type_ptr_to(self.layout_of(self.tcx.types.str_).gcc_type(self)),
);
// And we return the pointer and its length.
(cs, self.const_usize(len as _))
}
But the codegen backends can also add more information to the underlying binary code generator. For example, in Rust, we use references a lot. A reference is basically a pointer that cannot be NULL. We need to give this information as well!
In both GCC and LLVM, you can add attributes to a lot of items, like arguments of functions. So every time we see an argument behind a reference, we add the nonnnull() attribute.
Let's show an example with this Rust function:
Runfn t(a: &i32) -> i32 {
*a
}
The C equivalent looks like this:
int t(int *a) {
if (!a) {
return -1;
}
return *a;
}
Compiled with the -O3 option, it generates this assembly:
t:
test rdi, rdi ; Check if `a` is 0
je .L5 ; If `a` is 0, we jump to `.L1`
mov eax, DWORD PTR [rdi] ; We store `*a` value into `eax` registry
ret ; We exit the function
.L5:
mov eax, -1 ; We store `-1` into `eax` registry
ret ; We exit
However, the Rust compiler knows that a can never be NULL, so the codegen adds _attribute_((nonnull(1))) on the function:
_attribute_((nonnull(1)))
int t(int *a) {
if (!a) {
return -1;
}
return *a;
}
Which generates this assembly:
t:
mov eax, DWORD PTR [rdi]
ret
Since the codegen knows that the if (!a) condition will never be true, why keeping it around?
And it's just one example of extra information/optimization we do in the Rust backends. And that doesn't even cover in the slighest the monstruous amount of optimizations the codegen themselves do. If you want to have more examples of such optimizations, I strongly recommend reading the "Advent of Compiler Optimizations" blog posts written by Matt Godbolt (the developer of godbolt.org, another priceless tool).
So now you know what a Rust backend is, and why GCC backend is also an interesting thing to have while also learning about some optimizations we do behind developers back. :)
This blog post was made thanks to my cat hanging to it.
