Building a BASIC Interpreter, '80s style

It’s funny the rabbit holes one ends up. One of my personal projects for several years has been the creation (exploration, really) of a “fake emulator” — that is, an emulator for a computer that never existed all written in JavaScript. Instead, the machine would pay homage to the eight and sixteen bit machines of the 1980s and ‘90s.

I like to do things the hard way, though: this machine would also be based on a novel instruction set as well. The instruction set would be similar to that of the era, but also be a little easier to use. And so, Retroputer was born. Over several years, the implementation has been built out and improved, although it’ll probably never be “complete” (it’s a personal exploration, after all).

Then @bbcmicrobot became a thing, and I wanted to be able to do a similar thing for Retroputer. My JS development skills are mostly in the arena of the front-end, and so this would be a cool way to get some more back-end skills. One problem: Retroputer could only understand its own assembly language. It had no BASIC support yet.

And so here I am, building a BASIC interpreter, ‘80s style — that is, entirely in assembly language, just like it used to be done. And I thought I’d share that journey, since it’s not often we delve into areas so far from our typical abstractions. My daily driver (JavaScript) makes a lot of things trivial and sometimes those things feel magical. Understanding the lowest levels of the process can often help with understanding those abstractions.

And so… let’s begin.

Parsing in Low-Level Assembly Language

When I wrote the assembler for Retroputer, I was able to use a really nice tool called Pegjs. This made quick work of the assembler’s custom syntax, but was unfortunately there’s nothing like it for Retroputer ASM.

Which means we have to do it the hard way.

Parsing actually occurs in multiple phases. A language that uses a compiler parses the code into an abstract syntax tree (or similar concept), and can then use that tree to generate the resulting native code. A consequence of this is that the program must be syntactically correct in order for the compilation to be successful.

Some interpreters today also have this concept because it’s often useful to generate the intermediate AST and execute from there than it is to execute from the original source.

But for a BASIC interpreter in a machine with limited resources, the most resource-effective way to parse is to do it in multiple phases — some of which occurs at run time. This means, however, that syntax errors often can’t be detected until the program is run and the area of code with the error is encountered.

The three phases of Retroputer BASIC parsing are as follows:

  1. Line Transformation
  2. Tokenization
  3. Runtime Syntax Checking

The first two steps occur as the user enters a program (or loads one). The last one occurs while the program is running. Essentially, the first two build out the rough scaffolding of an airplane, but with no guarantee of flight. The last step is essentially acting as a test pilot — hoping you’ll get off the ground, but not knowing until you try.

Thankfully Retroputer BASIC doesn’t come with such dire consequences for raising an error during runtime.

Note: Source code (in progress) for Retroputer BASIC is available on GitHub.

Line Transformation

This is the easiest part of the entire process. Essentially, the line that the user enters is converted to uppercase so that later processes are easier (and faster). BASIC is not sensitive to case, and so we can use that to our advantage.

print 2+2
' becomes:
PRINT 2+2

Doing this in JavaScript is easy, right?

theLine = theLine.toUpperCase();

But in assembly language, we have to be more detailed about how things get done. We need to read in a character, convert it to uppercase, and then store it somewhere.

           ld  y,  0                # y is our index
_loop:     ld  al, [d, x, y]        # [d,x] is pointer to string
           cmp al, 97               # is al (char) in range?
           brs n   _continue        # Not a lowercase char; continue
           cmp al, 123              # high portion
           brs !n  _continue        # not a lowercase char; continue
           and al, 0b1101_1111      # uppercase!
           st  [d, x, y], al        # store back (we modify the original)
_continue: inc y                    # move our index along
           cmp al, 0                # check for NULL
           brs !z _loop             # No? Go back for more.

The above doesn’t quite match the same semantics as the JavaScript version. One important difference is that we now use Unicode to work with text, and so converting input from lowercase to uppercase can often be more difficult — and maybe impossible (depending on the language). Retroputer lives in the world of ASCII (rather, it’s own variation, named RetSCII), meaning that all supported characters are encoded into eight bits. This is woefully inadequate for many languages, but also true to the period.

It also means that we can use a nice feature of ASCII to convert from lowercase to uppercase. It turns out that uppercase “A” is represented with 65 in ASCII, and lowercase “a” is represented with 97. If you’re familiar with your powers-of-two, that difference should catch your eye.

So it turns out that lowercase letters are represented with a number that’s exactly 32 above the uppercase letter. Once we know something’s in range, all we need to do is subtract 32!

That works, but we could just do some bit twiddling. For Retroputer this wouldn’t actually be any faster than subtraction, but avoiding subtraction means we don’t have to worry about the carry/borrow flag during arithmetic. It turns out we can use a bitwise and to turn off the bit for the 32 place value instead.

and al, 0b1101_1111         # turn off bit in 32-place
# versus
clr c                       # clear carry
sub al, 32                  # subtract 32

But there’s a catch: not everything can be converted to uppercase. If the user has included a string literal, for example, we have to be more careful. After all, we don’t want Retroputer BASIC to scream at the user all the time, right? (Although many computers of the era didn’t have lowercase capability, Retroputer doesn’t share that same limitation.)

For example:

print "Hello, World!"
' should become:
PRINT "Hello, World!"
' and not
PRINT "HELLO, WORLD!"

This means we need to keep track of whether or not we’re in the middle of a string literal. In BASIC, there’s only one signifier for this: the double quote. If we check if a character is a double quote, we can set a flag, and depending on the flag’s value, we can perform an uppercase operation or leave things alone.

It turns out that in JavaScript there’s no built-in to accomplish this, but we can build one:

const len = theLine.length;
let insideString = false;
for (let i = 0; i < len; i++) {
    const ch = theLine[i];
    if (ch === `"`) insideString = !insideString;
    if (!insideString) {
        const newCh = ch.toUpperCase();
        if (ch !== newCh) theLine[i] = newCh;
    }
}

Now the logic of the JS more closely matches that of the assembly version, although we’re taking advantage of JS’s unicode support a bit more.

The assembly version looks like this:

           ld  y,  0                # y is our index
           ld  bl, 0                # === insideString (false)
_loop:     ld  al, [d, x, y]        # [d,x] is pointer to string
           cmp al, 34               # is al a double quote?
           brs !z  check_char       # no? should we uppercase it?
           xor bl, 0xFF             # yes? toggle insideString
_check_char:
           cmp bl, 0xFF             # inside a string?
           brs z   _continue        # yes? don't modify it
           cmp al, 97               # is al (char) in range? "a"
           brs n   _continue        # Not a lowercase char; continue
           cmp al, 123              # high portion           "z"
           brs !n  _continue        # not a lowercase char; continue
           and al, 0b1101_1111      # uppercase!
           st  [d, x, y], al        # store back (we modify the original)
_continue: inc y                    # move our index along
           cmp al, 0                # check for NULL
           brs !z _loop             # No? Go back for more.

So far all we’ve done is transform the input text to uppercase, but there’s one extra benefit here in the way we’ve had to track if we’re inside a string. We can do one round of syntax checking here!

If, at the end of the process we find that inString is still true (bl = 0xFF), we can trigger an error, because it means that there’s an unterminated string literal somewhere in the line.

Side note: It turns out many BASICs are quite lenient when it comes to terminating quotes for strings. One of many things I learned while building my own interpreter. Even so, it doesn’t feel right to me, and so Retroputer BASIC doesn’t permit it.

Tokenization

The next phase of parsing involves converting an entered line into something more efficient for Retroputer BASIC to execute. This is as close to the concept of an abstract syntax tree that we’ll get here — the result will definitely not be a tree. But it will be something that we can quickly evaluate during runtime.

One common feature of early microcomputers was a very limited memory capacity. Retroputer has more memory than most machines of the time had by default, but it still has much less than modern machines. As such, long BASIC programs could easily consume far too much memory if they were stored as the user typed them.

To save space, keywords are tokenized as the program is entered into memory. This process converts keywords into single-byte tokens. Keywords are always at least two bytes long, and so this savings can add up. It also means that we can use a lookup table during execution to call the appropriate assembly language routines.

Retroputer BASIC goes a little further than most BASICs of the time, though. It will also convert numbers to binary representations, mark strings, calculate variable references, and more. This wastes some space, to be honest, but the performance benefits (and ease of execution) help outweigh this.

So, there’s a few steps involved here:

  1. Tokenize numbers

    Numbers are converted to their binary form to avoid having to convert them each time they are encountered. For numbers encountered only once, this isn’t a huge performance benefit, but in a tight loop, this is beneficial since the number is already in a form the computer can understand.

  2. Mark strings

    Because memory is limited, if there’s a string in the code that can be used as-is, it makes sense to do so. For example, PRINT “Hello, World” can print “Hello, World” directly from the program line, rather than allocating new space, copying the string, and then printing it.

    To make it easy to skip strings during execution, we also store the length of the string itself.

  3. Search keyword table

    Anything that’s not a number or a string might be a keyword — so we need to take a look through the list of keywords. This is trivial in JavaScript, but it’s not so easy in assembly language!

    Once a keyword is found, the associated token is stored in program memory (instead of the entire keyword itself). This can result in significant storage savings, especially when PRINT can be reduced to a single byte!

  4. Calculate variable pointers

    Retroputer BASIC variable names are only significant to the first two characters (currently). This makes it trivial to look up a variable in an array with a fairly simple mathematical expression. Even so, this calculation takes time, and so it’d be nice if we didn’t have to do it every time we encountered the variable.

    Retroputer BASIC will calculate this index and store it alongside the variable name. In addition to the variable name, it also stores the length of the variable to speed up runtime execution. This consumes a good amount of space, and so wouldn’t have been a good solution on computers with limited memory, but it works for Retroputer BASIC.

I won’t go into the assembly language for this step in this post. I’ll save that for a future post. Rest assured, though, it takes a lot of code.

Runtime Syntax Checking

Last, but definitely not least, is checking syntax at runtime. This is reasonably trivial to do once you have a tokenized representation of the code.

First, as part of the execution phase, BASIC checks if it is currently looking at a token. All tokens have the high bit set (so they have a value of 128 or higher). If a token is found, we can determine what subroutine to call simply by looking it up in a vector table. This also makes it trivial to render syntax errors — some keywords make no sense as statements, and so the vector table just points to the routine that generates a syntax error.

Once a statement’s token handler is called, the handler takes over additional parsing responsibilities. It can use gettok, gettok-raw, peektok, etc., to get and advance past tokens. If the token is something the routine didn’t expect, the routine just returns an error code. This is where both syntax and type errors are caught.

Should a statement need to evaluate an expression, another phase of parsing is performed. During expression parsing another vector lookup table is used, which means we can catch keywords that don’t make sense inside a mathematical expression and raise the appropriate errors. For example, if you tried to enter PRINT 2+CLS, you’d get a syntax error at the CLS portion (CLS is a keyword that is short for “clear screen”).

Note: We can also determine operator precedence and number of required parameters for functions from this table. This is important for actually evaluating the expression, but we also use these to catch cases where the user may not have supplied enough arguments.

Because the token directly maps to an entry in a vector lookup table, execution can proceed pretty quickly with minimal effort. The work of parsing each kind of statement is left to the handler itself, and generally this isn’t too much of a problem. PRINT and INPUT are probably the most complex to parse, but every step is taken a token at a time.

Because a lot of checking isn’t done until runtime, it does mean that you can have partial results before an error occurs. For example:

PRINT "Hello";CLS
Hello
?Syntax Error

It also means that should your program leave the screen in a state where you can’t actually see text, you could be up a tree in terms of recovering. The syntax error is printed, but if you can’t see it… well, what are you going to do?

There’s definitely downsides to this kind of syntax checking, but it also makes for a reasonably simple interpreter as well.

Next Time

Next time we’ll talk go into a little more detail about how the second parsing phase works, and how much easier it would be in JavaScript with modern abstractions and standard libraries. But every step in this process gives me an even greater appreciation for our modern conveniences, and just how much work is going on below the surface.

Edit History:

  • 8 July 2020: Updated vector table links to new auto-generated token files.


Graham Toal picture

Here's a contemporary 80's book on writing interpreters such as Basic that's worth a read: http://lmgtfy.com/?q=%22Writing+Interactive+Compilers+and+Interpreters%22+%22djvu%22

Carl Gundel picture

Nice work. This is so great and I hope what you are doing also encourages other people to try their hand at this kind of thing. Some programmers think that what you're doing is a kind of black magic wizardry, but I think you just need a good teacher. That teacher can be books, or it can be your work. So, thanks for taking the time to show people what you're doing!

Are you familiar with the book The Elements of Computing Systems? It explains designing a computer from scratch and there is software that can be used to implement a computer in software.

My own Liberty BASIC language grew out of smaller interpreter kinds of work including a simulator (written in BASIC) for machining equipment, a tiny SQL-like report engine, and finally a BASIC-like scripting language for a robot battle game that I abandoned when I decided to just go ahead and make a full blown BASIC, and this became Liberty BASIC. After I had taken my first effort in language design as far as it could go, I had a lot of ideas and decided to reimplement it from scratch. Some of my own users have gone on to implement their own systems. :-)

Liberty BASIC home page

Kerri Shotts picture

Thanks! I've always believed that knowing how a computer does what it does is instrumental in figuring out how to solve problems effectively and then debug when things go wrong. Computers can seem like so much magic in modern development, but even a little glimpse of the internal mechanism can reduce the mystery and provide some new paths of investigation and discovery.

I'm not familiar with that book, but now I'm going to take a look at it! I think I'll start collecting this information in a repo :-)

Liberty BASIC looks pretty awesome! I've come across it a few times, but haven't had a chance to try it since I'm on macOS. As I've been building out BASIC, I'm rediscovering the joys of BASIC programming. It's had a bad rap for a long time, but it's also so approachable and makes iteration super fast and easy.

It's always cool to see when people start building their own systems after they've experienced yours. :-)

Carl Gundel picture

Thanks for the reply. You can use Liberty BASIC v5.0 alpha releases on MacOS, and also Windows, x86 Linux and also Raspberry Pi. It's coming along pretty well. Alpha test announcement

Kerri Shotts picture

Oh, that's awesome! Checking it out!

Carl Gundel picture

Let me know what you think. It would be great to see you in our community forum!

Zsolt Tarczali picture

Nice post & project! Thanks

Rhett Trickett picture

Very cool project, Kerri! Came across the online demo in your repo and it looks awesome. Thanks for sharing!