Week 3: Cobra, Due Tuesday, April 25 (Open Collaboration)
(Yes, that's 3 open collaboration assignments in a row 🙂)
In this assignment you'll implement a compiler for a small language called Cobra, which extends Boa with booleans, conditionals, variable assignment, and loops.
Setup
Get the assignment at https://classroom.github.com/a/mHUq5XJd This will make a private-to-you copy of the repository hosted within the course's organization. You can also access the public starter code https://github.com/ucsd-compilers-s23/cobra-starter if you don't have or prefer not to use a Github account.
The Cobra Language
Concrete Syntax
The concrete syntax of Cobra is:
<expr> :=
| <number>
| true
| false
| input
| <identifier>
| (let (<binding>+) <expr>)
| (<op1> <expr>)
| (<op2> <expr> <expr>)
| (set! <name> <expr>)
| (if <expr> <expr> <expr>)
| (block <expr>+)
| (loop <expr>)
| (break <expr>)
<op1> := add1 | sub1 | isnum | isbool
<op2> := + | - | * | < | > | >= | <= | =
<binding> := (<identifier> <expr>)
true
and false
are literals. Names used in let
cannot have the name of
other keywords or operators (like true
or false
or let
or block
).
Numbers should be representable as a signed 63-bit number (e.g. from
-4611686018427387904 to 4611686018427387903).
Abstract Syntax
You can choose the abstract syntax you use for Cobra. We recommend something like this:
enum Op1 { Add1, Sub1, IsNum, IsBool, }
enum Op2 { Plus, Minus, Times, Equal, Greater, GreaterEqual, Less, LessEqual, }
enum Expr {
Number(i32),
Boolean(bool),
Id(String),
Let(Vec<(String, Expr)>, Box<Expr>),
UnOp(Op1, Box<Expr>),
BinOp(Op2, Box<Expr>, Box<Expr>),
If(Box<Expr>, Box<Expr>, Box<Expr>),
Loop(Box<Expr>),
Break(Box<Expr>),
Set(String, Box<Expr>),
Block(Vec<Expr>),
}
Semantics
A "semantics" describes the languages' behavior without giving all of the assembly code for each instruction.
A Cobra program always evaluates to a single integer, a single boolean, or ends
with an error. When ending with an error, it should print a message to
standard error (eprintln!
in Rust works well for this) and a non-zero exit
code (std::process::exit(N)
for nonzero N
in Rust works well for this).
input
expressions evaluate to the first command-line argument given to the program. The command-line argument can be any Cobra value: a valid number ortrue
orfalse
. If no command-line argument is provided, the value ofinput
isfalse
. When running the program the argument should be provided astrue
,false
, or a base-10 number value.- All Boa programs evaluate in the same way as before, with one
exception: if numeric operations would overflow a 63-bit integer, the program
should end in error, reporting
"overflow"
as a part of the error. - If the operators other than
=
are used on booleans, an error should be raised from the running program, and the error should contain "invalid argument". Note that this is not a compilation error, nor can it be in all cases due toinput
's type being unknown until the program starts. - The relative comparison operators like
<
and>
evaluate their arguments and then evaluate totrue
orfalse
based on the comparison result. - The equality operator
=
evaluates its arguments and compares them for equality. It should raise an error if they are not both numbers or not both booleans, and the error should contain "invalid argument" if the types differ. - Boolean expressions (
true
andfalse
) evaluate to themselves if
expressions evaluate their first expression (the condition) first. If it'sfalse
, they evaluate to the third expression (the “else” block), and to the second expression if any other value (including numbers).block
expressions evaluate the subexpressions in order, and evaluate to the result of the last expression. Blocks are mainly useful for writing sequences that includeset!
, especially in the body of a loop.set!
expressions evaluate the expression to a value, and change the value stored in the given variable to that value (e.g. variable assignment). Theset!
expression itself evaluates to the new value. If there is no surrounding let binding for the variable the identifier is considered unbound and an error should be reported.loop
andbreak
expressions work together. Loops evaluate their subexpression in an infinite loop untilbreak
is used. Break expressions evaluate their subexpression and the resulting value becomes the result of the entire loop. Typically the body of a loop is written withblock
to get a sequence of expressions in the loop body.
There are several examples further down to make this concrete.
The compiler should stop and report an error if:
- There is a binding list containing two or more bindings with the same name.
The error should contain the string
"Duplicate binding"
- An identifier is unbound (there is no surrounding let binding for it) The
error should contain the string
"Unbound variable identifier {identifier}"
(where the actual name of the variable is substituted for{identifier}
) - A
break
appears outside of any surroundingloop
. The error should contain "break" - An invalid identifier is used (it matches one of the keywords). The error should contain "keyword"
If there are multiple errors, the compiler can report any non-empty subset of them.
Here are some examples of Cobra programs.
Example 1
Concrete Syntax
(let ((x 5))
(block (set! x (+ x 1))))
Abstract Syntax Based on Our Design
Let(vec![("x".to_string(), Number(5))],
Box::new(Block(
vec![Set("x".to_string(),
Box::new(BinOp(Plus, Id("x".to_string()),
Number(1)))])))
Result
6
Example 2
(let ((a 2) (b 3) (c 0) (i 0) (j 0))
(loop
(if (< i a)
(block
(set! j 0)
(loop
(if (< j b)
(block (set! c (sub1 c)) (set! j (add1 j)))
(break c)
)
)
(set! i (add1 i))
)
(break c)
)
)
)
Result
-6
Example 3
This program calculates the factorial of the input.
(let
((i 1) (acc 1))
(loop
(if (> i input)
(break acc)
(block
(set! acc (* acc i))
(set! i (+ i 1))
)
)
)
)
Implementing a Compiler for Cobra
The starter code makes a few infrastructural suggestions. You can change these as you feel is appropriate in order to meet the specification.
Reporting Dynamic Errors
We've provided some infrastructure for reporting errors via the
snek_error
function in start.rs
. This is a function that can be called from the
generated program to report an error. for now we have it take an error code as
an argument; you might find the error code useful for deciding which error
message to print. This is also listed as an extern
in the generated
assembly startup
code.
Calculating Input
We've provided a
parse_input
stub for you to fill in to turn the command-line argument to start.rs
into a
value suitable for passing to our_code_starts_here
. As a reminder/reference,
the first argument in the x86_64 calling convention is stored in rdi
. This
means that, for example, moving rdi
into rax
is a good way to get “the
answer” for the expression input
.
Representations
In class we chose representations with 0
as a tag bit for numbers and 1
for
booleans with the values 3
for true
and 1
for false
. You do not
have to use those, though it's a great starting point and we recommend it. Your
only obligation is to match the behavior described in the specification, and if
you prefer a different way to distinguish types, you can use it. (Keep in mind,
though, that you still must generate assembly programs that have the specified
behavior!)
Running and Testing
The test format changed slightly to require a test name along with a test
file name. This is to support using the same test file with different
command line arguments. You can see several of these in the sample
tests.
Note that providing input
is optional. These also illustrate how to check for
errors.
If you want to try out a single file from the command line (and perhaps from a
debugger like gdb
or lldb
), you can still run them directly from the
command line with:
$ make tests/some-file.run
$ ./tests/some-file.run 1234
where the 1234
could be any valid command-line argument.
As a note on running all the tests, the best option is to use make test
,
which ensures that cargo build
is run first and independently before cargo test
.
Grading
As with the previous assignment, a lot of the credit you get will be based on us running autograded tests on your submission. You'll be able to see the result of some of these on while the assignment is out, but we may have more that we don't show results for until after assignments are all submitted.
We'll combine that with some amount of manual grading involving looking at your
testing and implementation strategy. You should have your own thorough test
suite (it's not unreasonable to write many dozens of tests; you probably don't
need hundreds), and you need to have recognizably implemented a compiler. For
example, you could try to calculate the answer for these programs and
generate a single mov
instruction: don't do that, it doesn't demonstrate the
learning outcomes we care about.
Any credit you lose will come with instructions for fixing similar mistakes on future assignments.
FAQ
Some of my tests fail with a No such file or directory
error
The initial version of the starter code contained an error in the testing infrastructure. If you cloned before we fixed it, you'll have to update the code. You can update the code by running:
git remote add upstream https://github.com/ucsd-compilers-s23/cobra-starter
git pull upstream main --allow-unrelated-histories
This will merge all commits from the template into your repository. Alternatively, you can also
clone https://github.com/ucsd-compilers-s23/cobra-starter and manually replace your tests/
directory.
Extension: Using Dynamic Information to Optimize
A compiler for Cobra needs to generate extra instructions to check for booleans
being used in binary operators. We could use a static type-checker to avoid
these, but at the same time, the language is fundamentally dynamic because the
compiler cannot know the type of input
until the program starts running
(which happens after it is compiled). This is the general problem that systems
for languages like JavaScript and Python face; it will get worse when we
introduce functions in the next assignment.
However, if our compiler can make use of some dynamic information, we can do better.
There are two instructive optimizations we can make with dynamic information, one for standalone programs and one at the REPL.
Eval
Add a new command-line option, -e
, for “eval”, that evaluates a program
directly after compiling it with knowledge of the command-line argument. The
usage should be:
cargo run -- -e file.snek <arg>
That is, you provide both the file and the command-line argument. When called
this way, the compiler should skip any instructions used for checking for
errors related to input
. For example, for this program, if a number is given
as the argument, we could omit all of the tag checking related to the input
argument (and since 1
is a literal, we could recover essentially the same
compilation as for Boa).
(+ 1 input)
For this program, if input
is a boolean, we should preserve that the program
throws an error as usual.
Known Variables at the REPL
Similarly, after a define
statement evaluates at the REPL, we can know that
variable's tag and use that information to compile future entries. For example,
in this REPL sequence, we define a numeric variable and use it in an operator
later. We could avoid tag checks for x
in the later use:
> (define x (+ 3 4))
> (+ x 10)
Note a pitfall here – if you allow set!
on define
d variables, their types
could change mid-expression, so there are some restrictions on when this should
be applied. Make sure to test this case.
Happy hacking!
Discussion
It's worth re-emphasizing that a static type-checker could recover a lot of
this performance, and for Cobra it's pretty straightforward to implement a
type-checker (especially for expressions that don't involve input
).
However, we'll soon introduce functions, which add a whole new layer of potential dynamism and unknown types (because of function arguments), so the same principles behind these simple cases become much more pronounced. And a language with functions and a static type system is quite different from a language with functions and no static type system.