nixpulvis - Building a Shell

Building a Shell - Part 2 2134 words published on October 15, 2018.

This is part 2 of a series I’ll be writing on the creation of oursh, a modern POSIX compatible shell written in Rust. If you haven’t read Part 1, and want a brief overview of shell terminology and functionality I recommend reading it, otherwise this will be the first post on the Rust project.

In this post I want to dive into the skeleton of the project so far. Last time we made a shell in 58 lines of Ruby. I then started the oursh project in (amazingly) the same number of lines. Now the project has grown. I’ve added LALRPOP for parsing the POSIX grammar, it has a separate parser known as docopt for handling arguments, and it has something starting to resemble a readline(3) like library using termion. With everything, and the growing test suite this is the output of cloc.

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Rust                            15            268            677           1747
Bourne Shell                     2              0              0              3
-------------------------------------------------------------------------------
SUM:                            17            268            677           1750
-------------------------------------------------------------------------------

Ok, not 50… but again just like last time I think you’ll be surprised what 2,000 lines of Rust can get you. For example, these shell programs all currently work.

ls -la
cat README.md | wc
git add -u
true || false && true && echo reached
! true && echo not reached
{ sleep 1; date; }&

Build it to Believe

Before we begin, in order to really get what’s going on, I strongly urge you to download Rust and compile oursh yourself. You can run examples from this post as we go through them. Otherwise, just use bash and skip this section.

First clone the repository from GitHub into your local directory.

git clone https://github.com/nixpulvis/oursh

Then install Rust by either, visiting their official website or simply running brew install rustup, pacman -S rustup, etc. Once you have rustup installed, from inside the oursh project directory cargo has many commands at your disposal.

cargo build  # Just compile
cargo run    # Compile and run oursh
cargo test   # Compile and run the test suite

If you’re lazy (like me), here’s a script to copy-paste. This, in a way is the power of UNIX.

# macOS
brew install rustup
# Arch Linux
pacman -S rustup

git clone https://github.com/nixpulvis/oursh
cd oursh
cargo run

Depending on what version of oursh you have checked out the shell’s prompt may look different, and there are a few compilation features, but generally speaking you’ll get a shell that looks like this:

oursh-0.3$

Play around with it, it’s a real shell. If you want even more Rust between you and your OS I recommend Alacritty, a GPU-accelerated terminal also written in Rust.

Something New, Before a lot of Old

Before we dive into the dense topic of language, let me give you a taste of something I’m really excited about, which is only possible with some knowledge of the following material. I’m calling them shebang blocks after the #! symbol you can use to specify an interpreter at the start of a script file. These do the same thing but are embedded in the shell’s recursive grammar, just see for yourself.

Print my cat’s name (π) from Ruby.

{#!/usr/bin/env ruby; puts Math::PI}

Same thing, just syntactic sugar.

{#!ruby; puts Math::PI}
{#!node; console.log(Math.PI)}
{#!racket; #lang racket (println pi)}

This feature is currently experimental, but working. You can try it for yourself by compiling with cargo run --features=shebang-block. These more advanced commands will also Just Work once I get to implementing background programs (#6) and environments/variables (#27).

# Save π to the ENV var $PI.
PI={#!node; console.log(Math.PI)}; echo $PI

# Start a Ruby server in the background.
{#!ruby; require 'server'; Server.start}&

OK, now back to your regularly scheduled series on programming languages, grammar, POSIX shell scripts and Rust.

Shell Language

A script is, in general, some text which is interpreted by a program on a computer. A shell, like the one we’re building (or bash, sh, …), is one such interpreter. You can pass a shell script to it and it will correctly execute the program. A POSIX shell program is typically just known as “shell script”, though other types of shell scripts exist; like fish scripts. For convenience, we’ll simply refer to a POSIX compatible shell as sh.

The first implementation of sh was by Ken Thompson around 1971, which is why you may also hear this referred to as a Thompson shell. The next major development was in 1979 with the Bourne shell, also known as sh, which was a rewrite by Stephen Bourne at Bell Labs. The most common shell these days is bash, which stands for Bourne-Again shell.

Now finally, for some Rust code…

`use oursh::program::basic::Program;`

This is the most basic implementation of a Program within oursh. The only thing BasicProgram is able to do is run commands like ls or git status, but nothing more complex like cat $1 | wc -c, or new like {#!ruby; p 1}.

This implementation doesn’t use an fancy parsers, and simply reads a full line and splits it by spaces, passing the result directly through to exec on run.

# Run oursh using alternate (basic currently) program parsing.
cargo run -- -#

Formal Grammar Review

A formal grammar is a set of rules for creating non-terminals from terminals, where the terminals represent the set of characters (or tokens) in the language. A set of tokens in a language is known as an alphabet. A non-terminal represents some production based on underlying non-terminals and terminals, in this way allowing for recursive definitions.

`extern crate lalrpop;`

LALRPOP is a parser generator, which is essentially a framework for creating parsers for some kinds of languages. In this case, LALRPOP runs in a build phase before the compiler and generates a valid Rust parser module. The parser is defined to correctly parse the grammar defined in a corresponding *.lalrpop file. To understand what this means we need to understand what we mean by parser.

A parser’s job is to take in a stream of tokens (things like #,", }, WORD, etc) and produce an AST. Tokens are the terminals, while the grammar description *.lalrpop file defines the non-terminal rules.

Our AST is mainly one struct and one heavily recursive enum. The main goal of the AST is to preserve the needed information from the text to easily map to the execution semantics. For example consider this example from grade school: 1 + 2 * 3. We typically learn a rule called PEMDAS to remember the order of precedence so we correctly read the previous expression as 1 + (2 * 3). Similarly, we have true && false || true, which could be parsed one of two ways:

(false && true) || true  #=> true
false && (true || true)  #=> false

Before we look at the grammar rules, let’s look at our AST definition.

`use oursh::program::posix::ast::Program;`

Internally, posix::ast::Program uses the parser module generated automatically in it’s parse function.

let program = Program::parse(b"false && true || true" as &[u8])?;
println!("{:#?}", program);
-----
Program(
    [
        Or(
            And(
                Simple(
                    [
                        Word(
                            "false"
                        )
                    ]
                ),
                Simple(
                    [
                        Word(
                            "true"
                        )
                    ]
                )
            ),
            Simple(
                [
                    Word(
                        "true"
                    )
                ]
            )
        )
    ]
)

As you can see we correctly parse false && true || true as (false && true) || true, so this program will return true (or exit code 0).

The only major piece left to explain is the grammar description file. There are a lot of details to go over with the LALRPOP parser generator, but luckily it’s simple enough to understand the basics.

These are the needed rules inside posix.lalrpop to parse the above AST.

// Programs are constructed in other grammar rules...

Command: ast::Command = {
    // ...
    <cs: Command> "&&" <p: Pipeline> => {
        ast::Command::And(box cs, box p)
    },
    <cs: Command> "||" <p: Pipeline> => {
        ast::Command::Or(box cs, box p)
    },
    // ...
    Pipeline => <>,
}

// Other command rules are defined below.

Simple: ast::Command = {
    "WORD"+ => ast::Command::Simple(<>.iter().map(|w| {
        ast::Word(w.to_string())
    }).collect())
}

Command and Simple are non-terminal productions that return (when parsed) ast:Command types. As you can see, Command is recursively referenced in the && and || rules, allowing for ... && ... || ....

Words are created by the lexer, a tool designed to consume the input text, and iterate over chunks called tokens. Tokens, as already mentioned, are the terminals of the language. WORD, && and || from the above grammar rules, are our terminal tokens. WORDs are generated after reading a character that is allowed to start a word, until the word is considered finished. This allows for words like foo!, despite ! also being a token.

Deep Dive: Parser Combinators

If parsers are your thing, you may also be interested in a parser-combinator I wrote for students in NEU's introduction computer science class. It's writen in Racket, and comes with a fully baked JSON parser.

(require parser-combinator/json)

Testing

Just a quick note on testing, because it’s an important part of this project. Rust’s macro system does a great job allowing developers to write clean and fast tests. Here are a few integration tests from this project’s suite.

assert_oursh!("head README.md -n 1", "# oursh\n");
assert_oursh!("false && echo 1", "");
assert_oursh!("false || echo 1", "1\n");
assert_oursh!("{ echo pi; echo e; }", "pi\ne\n");
assert_oursh!("{#!ruby; puts 1}", "1\n");

Next Steps

Add a proper job runtime, with status, backgrounding. I’m still not 100% sure how to best implement this (#6).
More complete builtin support (#24).
The runtime can’t handle nested pipelines (#12).
The parser needs to support IO redirection syntax (#26).
Programs need variables, environments, and functions (#27, #43).
The lexer needs to handle qoutes correctly (#28).
Plenty of repl issues.

In the next post in this series we’ll look at some of the other features of this Rust project, including the CLI argument parser docopt, the terminal control library termion, nix and more.