Building a Shell - Part 2 2134 words published on October 15, 2018.
This is part 2 of a series I’ll be writing on the creation of oursh
,
a modern POSIX compatible shell written in Rust. If you haven’t read Part
1, and want a brief overview of shell terminology and functionality I
recommend reading it, otherwise this will be the first post on the Rust project.
In this post I want to dive into the skeleton of the project so far. Last time
we made a shell in 58 lines of Ruby. I then started the oursh
project in
(amazingly) the same number of lines. Now the project has grown.
I’ve added LALRPOP for parsing the POSIX grammar, it has a separate
parser known as docopt for handling arguments, and it has something
starting to resemble a readline(3)
like library using termion. With
everything, and the growing test suite this is the output of cloc
.
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Rust 15 268 677 1747
Bourne Shell 2 0 0 3
-------------------------------------------------------------------------------
SUM: 17 268 677 1750
-------------------------------------------------------------------------------
Ok, not 50… but again just like last time I think you’ll be surprised what 2,000 lines of Rust can get you. For example, these shell programs all currently work.
ls -la
cat README.md | wc
git add -u
true || false && true && echo reached
! true && echo not reached
{ sleep 1; date; }&
Build it to Believe
Before we begin, in order to really get what’s going on, I strongly urge
you to download Rust and compile oursh
yourself. You can run examples from
this post as we go through them. Otherwise, just use bash
and skip this
section.
First clone the repository from GitHub into your local directory.
git clone https://github.com/nixpulvis/oursh
Then install Rust by either, visiting their official website or simply
running brew install rustup
, pacman -S rustup
, etc. Once you have
rustup
installed, from inside the oursh project directory cargo
has many
commands at your disposal.
cargo build # Just compile
cargo run # Compile and run oursh
cargo test # Compile and run the test suite
If you’re lazy (like me), here’s a script to copy-paste. This, in a way is the power of UNIX.
# macOS
brew install rustup
# Arch Linux
pacman -S rustup
git clone https://github.com/nixpulvis/oursh
cd oursh
cargo run
Depending on what version of oursh
you have checked out the shell’s prompt
may look different, and there are a few compilation features, but generally
speaking you’ll get a shell that looks like this:
oursh-0.3$
Play around with it, it’s a real shell. If you want even more Rust between you and your OS I recommend Alacritty, a GPU-accelerated terminal also written in Rust.
Something New, Before a lot of Old
Before we dive into the dense topic of language, let me give you a taste of
something I’m really excited about, which is only possible with some knowledge
of the following material. I’m calling them shebang blocks after
the #!
symbol you can use to specify an interpreter at the start of a script
file. These do the same thing but are embedded in the shell’s recursive
grammar, just see for yourself.
Print my cat’s name (π) from Ruby.
{#!/usr/bin/env ruby; puts Math::PI}
Same thing, just syntactic sugar.
{#!ruby; puts Math::PI}
{#!node; console.log(Math.PI)}
{#!racket; #lang racket (println pi)}
This feature is currently experimental, but working. You can try it for
yourself by compiling with cargo run --features=shebang-block
. These more
advanced commands will also Just Work once I get to implementing background
programs (#6) and environments/variables (#27).
# Save π to the ENV var $PI.
PI={#!node; console.log(Math.PI)}; echo $PI
# Start a Ruby server in the background.
{#!ruby; require 'server'; Server.start}&
OK, now back to your regularly scheduled series on programming languages, grammar, POSIX shell scripts and Rust.
Shell Language
A script is, in general, some text which is interpreted by a program on a
computer. A shell, like the one we’re building (or bash
, sh
, …), is one
such interpreter. You can pass a shell script to it and it will correctly
execute the program. A POSIX shell program is typically just known as “shell
script”, though other types of shell scripts exist; like fish scripts.
For convenience, we’ll simply refer to a POSIX compatible shell as sh
.
The first implementation of sh
was by Ken Thompson around 1971, which is why
you may also hear this referred to as a Thompson shell. The next major
development was in 1979 with the Bourne shell, also known as sh
, which was a
rewrite by Stephen Bourne at Bell Labs. The most common shell these days is
bash
, which stands for Bourne-Again shell.
Now finally, for some Rust code…
use oursh::program::basic::Program;
This is the most basic implementation of a Program
within oursh. The only
thing BasicProgram
is able to do is run commands like ls
or git status
,
but nothing more complex like cat $1 | wc -c
, or new like {#!ruby; p 1}
.
This implementation doesn’t use an fancy parsers, and simply reads a full line
and splits it by spaces, passing the result directly through to exec
on
run
.
# Run oursh using alternate (basic currently) program parsing.
cargo run -- -#
Formal Grammar Review
A formal grammar is a set of rules for creating non-terminals from terminals, where the terminals represent the set of characters (or tokens) in the language. A set of tokens in a language is known as an alphabet. A non-terminal represents some production based on underlying non-terminals and terminals, in this way allowing for recursive definitions.
extern crate lalrpop;
LALRPOP is a parser generator, which is essentially a framework for
creating parsers for some kinds of languages. In this case, LALRPOP runs in a
build phase before the compiler and generates a valid Rust parser module. The
parser is defined to correctly parse the grammar defined in a corresponding
*.lalrpop
file. To understand what this means we need to understand what we
mean by parser.
A parser’s job is to take in a stream of tokens (things like #
,"
, }
,
WORD
, etc) and produce an AST. Tokens are the terminals, while the grammar
description *.lalrpop
file defines the non-terminal rules.
Our AST is mainly one struct
and one heavily recursive enum
. The main goal
of the AST is to preserve the needed information from the text to easily map to
the execution semantics. For example consider this example from grade school:
1 + 2 * 3
. We typically learn a rule called PEMDAS to remember the order of
precedence so we correctly read the previous expression as 1 + (2 * 3)
.
Similarly, we have true && false || true
, which could be parsed one of two
ways:
(false && true) || true #=> true
false && (true || true) #=> false
Before we look at the grammar rules, let’s look at our AST definition.
use oursh::program::posix::ast::Program;
Internally, posix::ast::Program
uses the parser module generated
automatically in it’s parse
function.
let program = Program::parse(b"false && true || true" as &[u8])?;
println!("{:#?}", program);
-----
Program(
[
Or(
And(
Simple(
[
Word(
"false"
)
]
),
Simple(
[
Word(
"true"
)
]
)
),
Simple(
[
Word(
"true"
)
]
)
)
]
)
As you can see we correctly parse false && true || true
as (false && true)
|| true
, so this program will return true (or exit code 0).
The only major piece left to explain is the grammar description file. There are a lot of details to go over with the LALRPOP parser generator, but luckily it’s simple enough to understand the basics.
These are the needed rules inside posix.lalrpop
to parse the above AST.
// Programs are constructed in other grammar rules...
Command: ast::Command = {
// ...
<cs: Command> "&&" <p: Pipeline> => {
ast::Command::And(box cs, box p)
},
<cs: Command> "||" <p: Pipeline> => {
ast::Command::Or(box cs, box p)
},
// ...
Pipeline => <>,
}
// Other command rules are defined below.
Simple: ast::Command = {
"WORD"+ => ast::Command::Simple(<>.iter().map(|w| {
ast::Word(w.to_string())
}).collect())
}
Command
and Simple
are non-terminal productions that return (when parsed)
ast:Command
types. As you can see, Command
is recursively referenced in the
&&
and ||
rules, allowing for ... && ... || ...
.
Words are created by the lexer, a tool designed to consume the input text,
and iterate over chunks called tokens. Tokens, as already mentioned, are the
terminals of the language. WORD
, &&
and ||
from the above grammar rules,
are our terminal tokens. WORD
s are generated after reading a character that
is allowed to start a word, until the word is considered finished. This allows
for words like foo!
, despite !
also being a token.
Deep Dive: Parser Combinators
If parsers are your thing, you may also be interested in a parser-combinator I wrote for students in NEU's introduction computer science class. It's writen in Racket, and comes with a fully baked JSON parser.
Testing
Just a quick note on testing, because it’s an important part of this project. Rust’s macro system does a great job allowing developers to write clean and fast tests. Here are a few integration tests from this project’s suite.
assert_oursh!("head README.md -n 1", "# oursh\n");
assert_oursh!("false && echo 1", "");
assert_oursh!("false || echo 1", "1\n");
assert_oursh!("{ echo pi; echo e; }", "pi\ne\n");
assert_oursh!("{#!ruby; puts 1}", "1\n");
Next Steps
- Add a proper job runtime, with status, backgrounding. I’m still not 100% sure how to best implement this (#6).
- More complete builtin support (#24).
- The runtime can’t handle nested pipelines (#12).
- The parser needs to support IO redirection syntax (#26).
- Programs need variables, environments, and functions (#27, #43).
- The lexer needs to handle qoutes correctly (#28).
- Plenty of
repl
issues.
In the next post in this series we’ll look at some of the other features of
this Rust project, including the CLI argument parser docopt
, the terminal
control library termion
, nix
and more.