Adder
In this assignment you'll implement a compiler for a small language called
Adder, that supports 32-bit integers and three operations – add1
, sub1
,
and negate
. There is no starter code for this assignment; you'll do it from
scratch based on the instructions here.
Setup
The necessary tools are installed on ieng6.ucsd.edu
, and you should have a
course-specific account set up that you can access via
ETS. You can use standard remote access
tools like ssh
or Visual Studio Code plugins like
sshfs
to match your preferred working style.
You may also want to work on your own computer. You will need to install rust
and cargo
:
https://www.rust-lang.org/tools/install
You may also (depending on your system) need to install
nasm
. On OSX I used brew install nasm
; on other
systems your package manager of choice likely has a version. On Windows you
should use Windows Subsystem for Linux (WSL)
The assignments assume that your computer can build and run x86-64-bit
binaries. This is true of most (but not all) mass-market Windows and Linux
laptops. Newer Macs use a different ARM architecture, but can also run legacy
x86-64-bit binaries, so those are fine as well. You should ensure that
whatever you do to build your compiler also runs on ieng6
, which is our
standard environment for testing these.
The first few sections of the Rust Book walk you through installing Rust, as well. We'll assume you've gone through the “Programming a Guessing Game” chapter of that book before you go on, so writing and running a Rust program isn't too weird to you.
The Adder Language
In each of the next several assignments, we'll introduce a language that we'll implement. We'll start small, and build up features incrementally. We're starting with Adder, which has just a few features –numbers and three operations.
There are a few pieces that go into defining a language for us to compile:
- A description of the concrete syntax – the text the programmer writes.
- A description of the abstract syntax – how to express what the programmer wrote in a data structure our compiler uses.
- A description of the behavior of the abstract syntax, so our compiler knows what the code it generates should do.
Concrete Syntax
The concrete syntax of Adder is:
<expr> :=
| <number>
| (add1 <expr>)
| (sub1 <expr>)
| (negate <expr>)
Abstract Syntax
The abstract syntax of Adder is a Rust datatype, and corresponds nearly
one-to-one with the concrete syntax. We'll show just the parts for add1
and
sub1
in this tutorial, and leave it up to you to include negate
to get
practice.
enum Expr {
Num(i32),
Add1(Box<Expr>),
Sub1(Box<Expr>)
}
The Box
type is necessary in Rust to create recursive types like these (see
Enabling Recursive Types with
Boxes).
If you're familiar with C, it serves roughly the same role as introducing a
pointer type in a struct field to allow recursive fields in structs.
The reason this is necessary is that the Rust compiler calculates a size and
tracks the contents of each field in each variant of the enum
. Since an
Expr
could be an Add1
that contains another Add1
that contains another
Add1
... and so on, there's no way to calculate the size of an enum variant like
Add1(Expr)
(What error do you get if you try?)
Values of the Box
type always have the size of a single reference (probably
represented as a 64-bit address on the systems we're using). The address will
refer to an Expr
that has already been allocated somewhere. Box
is one of
several smart pointer types whose memory are carefully, and automatically,
memory-managed by Rust.
Semantics
A ``semantics'' describes the languages' behavior without giving all of the assembly code for each instruction.
An Adder program always evaluates to a single i32.
- Numbers evaluate to
themselves (so a program just consisting of
Num(5)
should evaluate to the integer5
). add1
andsub1
expressions perform addition or subtraction by one on their argument.negate
produces the result of the argument multiplied by-1
There are several examples further down to make this concrete.
Here are some examples of Adder programs:
Example 1
Concrete Syntax
(add1 (sub1 5))
Abstract Syntax
Add1(Sub1(Number(5)))
Result
5
Example 2
Concrete Syntax
4
Abstract Syntax
Number(4)
Result
4
Example 3
Concrete Syntax
(negate (add1 3))
Abstract Syntax
Negate(Add1(Number(3)))
Result
-4
Implementing a Compiler for Adder
We're going to start by just compiling numbers, so we can see how all the infrastructure works. We won't give starter code for this so that you see how to build this up from scratch.
By just compiling numbers, we mean that we'll write a compiler for a language where the “program” file just contains a single number, rather than full-fledged programs with function definitions, loops, and so on. (We'll get there!)
Creating a Project
First, make a new project with
$ cargo new adder
This creates a new directory, called adder
, set up to be a Rust project.
The main entry point is in src/main.rs
, which is where we'll develop the
compiler. There's also a file called Cargo.toml
that we'll use in a little
bit, and a few other directories related to building that we won't be too
concerned with in this assignment.
The Runner
We'll start by just focusing on numbers.
It's useful to set up the goal of our compiler, which we'll come back to repeatedly in this course:
“Compiling” an expression means generating assembly instructions that evaluate it and leave the answer in the
rax
register.
Given this, before writing the compiler, it's useful to spend some time thinking about how we'll run these assembly programs we're going to generate. That is, what commands do we run at the command line in order to get from our soon-to-be-generated assembly to a running program?
We're going to use a little Rust program to kick things off. It will look like
this; you can put this into a file called runtime/start.rs
:
#[link(name = "our_code")]
extern "C" {
// The \x01 here is an undocumented feature of LLVM (which Rust uses) that ensures
// it does not add an underscore in front of the name, which happens on OSX
// Courtesy of Max New
// (https://maxsnew.com/teaching/eecs-483-fa22/hw_adder_assignment.html)
#[link_name = "\x01our_code_starts_here"]
fn our_code_starts_here() -> i64;
}
fn main() {
let i : i64 = unsafe {
our_code_starts_here()
};
println!("{i}");
}
This file says:
- We're expecting there to be a precompiled file called
libour_code
that we can load and link against (we'll make it in a few steps) - That file should define a global function called
our_code_starts_here
. It takes no arguments and returns a 64-bit integer. - For the
main
of this Rust program, we will call theour_code_starts_here
function in anunsafe
block. It has to be in anunsafe
block because Rust doesn't and cannot check thatour_code_starts_here
actually takes no arguments and returns an integer; it's trusting us, the programmer, to ensure that, which isunsafe
from its point of view. Theunsafe
block lets us do some kinds of operations that would otherwise by compile errors in Rust. - Then, print the result.
Let's next see how to build a libour_code
file out of some x86-64 assembly
that will work with this file. Here's a simple assembly program that has a
global label for our_code_starts_here
that has a “function body” that
returns the value 31
:
section .text
global our_code_starts_here
our_code_starts_here:
mov rax, 31
ret
Put this into a file called test/31.s
if you like, to test things out (you
should now have a runtime/
and a test/
directory that you created).
We can create a standalone binary program that combines these with these
commands (substitute macho64
for elf64
on OSX):
$ nasm -f elf64 test/31.s -o runtime/our_code.o
$ ar rcs runtime/libour_code.a runtime/our_code.o
$ ls runtime
libour_code.a our_code.o start.rs
$ rustc -L runtime/ runtime/start.rs -o test/31.run
$ ./test/31.run
31
The first command assembles the assembly code to an object file. The basic
work there is generating the machine instructions for each assembly
instruction, and enough information about labels like our_code_starts_here
to
do later linking. The ar
command takes this object file and puts it in a
standard format for library linking used by #[link
in Rust. Then
rustc
combines that .a
file and start.rs
into a single executable binary
that we named 31.run
.
We haven't written a compiler yet, but we do know how to go from files
containing assembly code to runnable binaries with the help of nasm
and
rustc
. Our next task is going to be writing a program that generates assembly
files like these.
Generating Assembly
Let's revisit our definition of compiling:
“Compiling” an expression means generating assembly instructions that evaluate it and leave the answer in the
rax
register.
Since, for now, our programs are going to be single expressions (in fact just
single numbers), this means that for a program like “5”, we want to generate
assembly instructions that put the constant 5
into rax
.
Let's write a Rust function that does that, with a simple main
function that
shows it working on a single hardcoded input; this goes in src/main.rs
and is
the start of our compiler:
/// Compile a source program into a string of x86-64 assembly
fn compile(program: String) -> String {
let num = program.trim().parse::<i32>().unwrap();
return format!("mov rax, {}", num);
}
fn main() {
let program = "37";
let compiled = compile(String::from(program));
println!("{}", compiled);
}
You can compile and run this with cargo run
:
$ cargo run
Compiling adder v0.1.0 (/Users/joe/src/adder)
mov rax, 37
Really all I did here was look up the documentation in Rust about converting a
string to an integer and template the number into a mov
command. The input
37
is hardcoded, and to use the output like we did above, we'd need to
copy-paste the mov
command into a larger assembly file with
our_code_starts_here
, and so on.
Here's a more sophisticated main
that takes two command-line arguments: a
source file to read and a target file to write the resulting assembly to. It
also puts the generated command into the template we designed for our generated
assembly:
use std::env;
use std::fs::File;
use std::io::prelude::*;
fn main() -> std::io::Result<()> {
let args: Vec<String> = env::args().collect();
let in_name = &args[1];
let out_name = &args[2];
let mut in_file = File::open(in_name)?;
let mut in_contents = String::new();
in_file.read_to_string(&mut in_contents)?;
let result = compile(in_contents);
let asm_program = format!("
section .text
global our_code_starts_here
our_code_starts_here:
{}
ret
", result);
let mut out_file = File::create(out_name)?;
out_file.write_all(asm_program.as_bytes())?;
Ok(())
}
Since this now expects files rather than hardcoded input, let's make a test
file in test/37.snek
that just contains 37
as contents. Then we'll read the
“program” (still just a number) from 37.snek
and store the resulting
assembly in 37.s
. (snek
is a meme-y spelling of snake, which is a theme of
the languages in this course.)
Then we can run our compiler with these command line arguments:
$ cat test/37.snek
37
$ cargo run -- test/37.snek test/37.s
$ cat test/37.s
section .text
global our_code_starts_here
our_code_starts_here:
mov rax, 37
ret
Then we can use the same sequence of commands from before to run the program:
$ nasm -f elf64 test/37.s -o runtime/our_code.o
$ ar rcs runtime/libour_code.a runtime/our_code.o
$ rustc -L runtime/ runtime/start.rs -o test/37.run
$ ./test/37.run
37
We're close to saying we've credibly built a “compiler”, in that we've taken some source program and gone all the way to a generated binary.
The next steps will be to clean up the clumsiness of running 3 post-processing
commands (nasm
, ar
, and rustc
), and then adding some nontrivial
functionality.
Cleaning up with a Makefile
There are a lot of thing we could do to try and assemble and run the program,
and we'll discuss some later in the course. For now, we'll simply tidy up our
workflow by creating a Makefile that runs through the compile-assemble-link
steps for us. Put these rules into a file called Makefile
in the root of the
repository (use elf64
on Linux):
test/%.s: test/%.snek src/main.rs
cargo run -- $< test/$*.s
test/%.run: test/%.s runtime/start.rs
nasm -f macho64 test/$*.s -o runtime/our_code.o
ar rcs runtime/libour_code.a runtime/our_code.o
rustc -L runtime/ runtime/start.rs -o test/$*.run
(Note that make
requires tabs not spaces, but we can only use spaces on the
website, so please replace the four spaces indentation with tab characters when
you copy it -- or copy it from Github.)
And then you can run just make test/<file>.run
to do the build steps:
$ make test/37.run
cargo run -- test/37.snek test/37.s
Finished dev [unoptimized + debuginfo] target(s) in 0.07s
Running `target/x86_64-apple-darwin/debug/adder test/37.snek test/37.s`
nasm -f macho64 test/37.s -o runtime/our_code.o
ar rcs runtime/libour_code.a runtime/our_code.o
rustc -L runtime/ runtime/start.rs -o test/37.run
The cargo run
command will re-run if the .snek
file or the compiler
(src/main.rs
) change, and the assemble-and-link commands will re-run if the
assembly (.s
file) or the runtime (runtime/start.rs
) change.
Adding Nontrivial Language Features
The overall syntax for the Adder language admits many more features than just
numbers. With the definition of Adder above, we can have programs like (add1 (sub1 6))
, for example. There can be any numbers of layers of nesting of the
parentheses, which means we need to think about parsing a little bit more.
We're going to design our syntax carefully to avoid thinking too much about parsing, though. The parenthesized style of Adder is a subset of what's called s-expressions. The Scheme and Lisp family of languages are some of the more famous examples of languages built in s-expressions, but recent ones like WebAssembly also use this syntax, and it's a common choice for language development to simplify decision around syntax, which can become quite tricky and won't be our focus in this course.
A grammar for s-expressions looks something like:
s-exp := number
| symbol
| string
| ( <s-exp>* )
That is, an s-expression is either a number, symbol (think of symbol like an identifier name), string, or a parenthesized sequence of s-expressions. Here are some s-expressions:
(1 2 3)
(a (b c d) e "f" "g")
(hash-table ("a" 100) ("b" 1000) ("c" 37"))
(define (factorial n)
(if (== n 1)
1
(factorial (* n (- n 1)))))
(class Point
(int x)
(int y))
(add1 (sub1 37))
One attractive feature of s-expressions is that most programming languages have
libraries for parsing them. There are several crates available for parsing
s-expressions in Rust. You're free to pick another one if you like it, but I'm
going to use sexp because its type definitions
work pretty well with pattern-matching and I find that helpful.
(lexpr also looks interesting, but the
Value
type is really clumsy with pattern matching so it's not great for this
tutorial.)
We can add ths package to our project by adding it to Cargo.toml
, which was
created when you used cargo new
. Make it so your Cargo.toml
looks like this:
[package]
name = "adder"
version = "0.1.0"
edition = "2021"
[dependencies]
sexp = "1.1.4"
Then you can run cargo build
and you should see stuff related to the sexp
crate be downloaded.
We can then use it in our program like this:
use sexp::*;
use sexp::Atom::*;
Then, a function call like this can turn a string into a Sexp
:
let sexp = parse("(add1 (sub1 (add1 73)))").unwrap()
(As a reminder, the .unwrap()
is our way of telling Rust that we are trusting
this parsing to succeed, and we'll panic!
and stop the program if the parse
doesn't succeed. We will talk about giving better error messages in these cases
later.)
Our goal, though, is to use a datatype that we design for our expressions, which we introduced as:
enum Expr {
Num(i32),
Add1(Box<Expr>),
Sub1(Box<Expr>),
}
So we should next write a function that takes Sexp
s and turns them into
Expr
s (or gives an error if we give an s-expression that doesn't match the
grammar of Adder). Here's a function that will do the trick:
fn parse_expr(s: &Sexp) -> Expr {
match s {
Sexp::Atom(I(n)) => Expr::Num(i32::try_from(*n).unwrap()),
Sexp::List(vec) => {
match &vec[..] {
[Sexp::Atom(S(op)), e] if op == "add1" => Expr::Add1(Box::new(parse_expr(e))),
[Sexp::Atom(S(op)), e] if op == "sub1" => Expr::Sub1(Box::new(parse_expr(e))),
_ => panic!("parse error"),
}
},
_ => panic!("parse error"),
}
}
(A Rust note – the parse_expr
function takes a reference to Sexp
(the type
&Sexp
) which means parse_expr
will have read-only, borrowed access to some
Sexp
that was allocated and stored somewhere else.)
This uses Rust pattern matching to match the specific cases we care about
for Adder – plain numbers and lists of s-expressions. In the case of lists, we
match on two the two specific cases that look like add1
or sub1
followed by
some other s-expression. In those cases, we recursively parse, and use
Box::new
to match the signature we set up in enum Expr
.
So we've got a way to go from more structure text—s-expressions—stored in
files and produce our Expr
structure. Now we just need to go from the Expr
ASTs to generated assembly. Here's one way to do that:
fn compile_expr(e: &Expr) -> String {
match e {
Expr::Num(n) => format!("mov rax, {}", *n),
Expr::Add1(subexpr) => compile_expr(subexpr) + "\nadd rax, 1",
Expr::Sub1(subexpr) => compile_expr(subexpr) + "\nsub rax, 1",
}
}
And putting it all together in main
:
fn main() -> std::io::Result<()> {
let args: Vec<String> = env::args().collect();
let in_name = &args[1];
let out_name = &args[2];
let mut in_file = File::open(in_name)?;
let mut in_contents = String::new();
in_file.read_to_string(&mut in_contents)?;
let expr = parse_expr(&parse(&in_contents).unwrap());
let result = compile_expr(&expr);
let asm_program = format!("
section .text
global our_code_starts_here
our_code_starts_here:
{}
ret
", result);
let mut out_file = File::create(out_name)?;
out_file.write_all(asm_program.as_bytes())?;
Ok(())
}
Then we can write tests like this add.snek
:
$ cat test/add.snek
(sub1 (sub1 (add1 73)))
And run our whole compiler end-to-end:
$ make test/add.run
cargo run -- test/add.snek test/add.s
Finished dev [unoptimized + debuginfo] target(s) in 0.02s
Running `target/x86_64-apple-darwin/debug/adder test/add.snek test/add.s`
nasm -f macho64 test/add.s -o runtime/our_code.o
ar rcs runtime/libour_code.a runtime/our_code.o
rustc -L runtime/ runtime/start.rs -o test/add.run
$ cat test/add.s
section .text
global our_code_starts_here
our_code_starts_here:
mov rax, 73
add rax, 1
sub rax, 1
sub rax, 1
ret
$ ./test/add.run
72
This is, of course, a very simple language. This tutorial serves mainly to make us use all the pieces of infrastructure that we'll build on throughout the quarter:
- An assembler (
nasm
) and a Rust main program (runtime/start.rs
) to build binaries - A definition of abstract syntax (
enum Expr
) - A parser for text (
parse
from thesexp
crate) and a parser for our abstract syntax (parse_expr
) - A code generator (
compile_expr
) that generates assembly fromExpr
s
Most of our future assignments will be built from just these pieces, plus extra infrastructure added as we need it.
Your TODOs
- Do the whole tutorial above, creating the project repository as you go. Write several tests to convince yourself that things are working as expected.
- Then, add support for
negate
as described in the beginning, and write several tests fornegate
as well. - In your terminal, demonstrate your compiler working on at least 5 different
examples by using
cat
on a sourcesnek
file, then showingmake
running, usingcat
on the resulting.s
file, and then running the resulting binary. Copy this interactino into a file calledtranscript.txt
Hand in your entire repository to the assignment-1-tutorial
assignment on
Gradescope. There is no automated grading for this assignment; we want you to
practice gaining your own confidence that your solution works (and
demonstrating that to us).