Building a BASIC Interpreter, '80s Style, Part 2

If you recall from part one of the series, we only really delved into Retroputer BASIC’s first phase of parsing — namely, converting the entered line into uppercase and checking if quotes were properly matched. (Note: it should be mentioned that we could do additional checks there too — like matching parentheses, but we don’t at the moment.)

I also mentioned that I wanted to address how much we take even fairly limited standard libraries for granted today (especially in relationship to JavaScript). Nothing described here is magic, but sometimes we take our tools for granted, and I find it useful to understand what’s going underneath the hood.

Tokenizing User Input

As mentioned in the last post, a common tactic used by BASICs of the era was that of tokenization. In order to save space and increase execution speed, keywords were “crunched” into single byte token (or two bytes if you needed more keywords).

Let’s imagine that you have a simple line of code that clears the screen and prints the now common greeting, like the following.

CLS: PRINT "Hello, World"

While this doesn’t take up a lot of memory on its own, if you write a long program, a lot of the words in that program will be keywords. If we crunched them (tokenized them), we could save a decent amount of space. For example, after tokenizing the line above, you’d see something more like this in memory:

8A E3 B5 FC Hello, World 00 00

It doesn’t take too much effort to map this back to our original statement:

Byte

Keyword

Notes

8A

CLS

E3

“:”

End of Statement

32

“ “

We store up to one space

B5

PRINT

FB

String-in-code

String literal follows

Hello, World, 00

Strings are NUL-terminated

00

End Of Line

Lines are also NUL-terminated

This may seem like a lot of work, but the space savings can be significant. It’s not a lot here, but even so, you should be able to imagine how it could quickly add up. In this case, our crunched result is 19 bytes. The original text is 26 bytes (counting a terminating NUL).

Saving some space becomes important if you’re a BASIC interpreter running on a machine with very limited RAM. Some BASICs had less than a kilobyte of RAM for user programs, so this crunching was indeed very important, and is something that would have been appealing even if it came with no additional benefits.

OK — so just how do we go about tokenizing something like this? It initially seems fairly trivial in JavaScript. Given an array of strings, you could easily do a quick replace of each keyword with the corresponding token. Right?

Sounds like a job for String#replace, right? Naively, we might try the following:

const tokens = ["CLS", "PRINT", ":" ];
function tokenize (inStr) {
    const newStr = inStr;
    tokens.forEach((token, idx) => newStr = newStr.replace(
        new RegExp(token, "g"), String.fromCharCode(128+idx)
    );
    return newStr;
}

Unfortunately, it all comes crashing down when you realize that we can’t go about replacing tokens inside of string literals. This means we need to proceed character by character, keeping the context in mind, in order to avoid crunching things that aren’t actually keywords.

This new algorithm much more closely matches that of the assembly language code in Retroputer — but JS still makes things a lot easier. Here’s the start of that JS code (we’ll fill it in throughout the post):

const tokens = ["CLS", "PRINT", ":" ];

function tokenize(inStr) {
    let out = [];                    // return value is "crunched" array
    let idx = 0;                     // index into inStr
    let ch = "";                     // current character
    while (idx < inStr.length) {
        ch = inStr.charCodeAt(idx);  // get the current character code
        
        // our logic is going to go here

        out.push(ch);                // for now push the character
        idx++;                       // and keep going (next char)
    }
    out.push(0);                     // we end with NUL
    return out;
}

We’re starting with a very simplified token list, as no one wants to see the entire table in this format (it’s long, and Retroputer actually builds its token tables from a JS file!), but this should be enough for our purposes here. The idea here is that when we see a token, we’ll record its index in an output array.

At this point we have a function that, for now, just converts a string into its equivalent character codes. It doesn’t do much useful at this point, but can serve as useful scaffolding.

Our assembly language version is pretty similar to the above as well.

  do {
    call _get-source-index     # get next character index (in y)
    dl := <BP+source>,y        # read next char from our source string (ch)
    call _adv-source-index     # advance our source index
    call _get-target-index     # get position in target   ("out" in JS)
    <BP+target>,y := dl        # store to target          (out[y] = ch)
    call _adv-target-index     # move it along
    cmp dl, 0                  # was char 0? if so, break
  } while !z

I’ve not included _get-source-index or the other functions above because they do what they say on the tin, and are simply getting, setting, or advancing our source and target index. One thing of note is that unlike in JS, we don’t have dynamically allocated arrays in assembly language, and so this algorithm pre-allocates a reasonably-sized buffer. As we advance through the input string, we have to know where to write the next token to the target buffer, and that’s what the target index is doing above. Each one of the functions we’re calling above returns the index in Y. For example, The _adv-target-index function advances the target index by one (equivalent to y++).

Careful with Literals

One thing that we should be careful of is that string literals could confuse our tokenizer — we can’t just replace every string of characters that matches a keyword with a token value. If we see a string literal with “CLS” in it, we shouldn’t try to tokenize that. It’s not intended to be executable, and if we print it… well, we’d print the byte corresponding with the token instead. Not likely what the developer wanted.

Number literals, on the other hand, shouldn’t have the same problem since BASIC doesn’t have any numbers as keywords. Even so, there’s zero point in doing a keyword search if we’re looking at a number — why waste time?

Tokenizing String Literals

So, first, let’s check if we’re starting a string — that’s not too hard to do in JS:

if (ch === 34) {
  out.push (0xFB);             // string-in-code token
  idx++;
  ch = inStr.charCodeAt(idx);  // get next character after the quote
  while (ch !== 34 && idx < inStr.length) {
    out.push(ch);
    idx++;
    ch = inStr.charCodeAt(idx);
  };
  idx++;                       // go past the quote
  out.push(0);                 // end of string
  continue;
}

A double quote is represented as character code 34. Other programming languages recognize many more quote styles (such as JS or C), but BASIC is simple here: double quotes or bust.

Once we see that we’re starting a string, we can simply consume the remaining characters and add them to our return array until we see another quote.

When we’ve read in the whole string, we need to add a NUL byte, since Retroputer BASIC uses C-style strings. If we wanted to use Pascal-style strings, we could have maintained a counter and made sure to insert the length of the string literal. Not a big deal either way. The only reason I went with NUL-terminated strings is because that’s super easy to deal with in assembly language since we can just compare against the NUL byte instead of maintaining a counter.

Ok, so that JavaScript wasn’t too hard. Most of us, I think, would reach for something a bit more abstract, like a regular expression. It makes the intent a little more obvious, I think.

if (ch === 34) {
  out.push (0xFB);             // string-in-code token
  const stringLiteral = inStr.substr(idx+1).match(/^[^"]*/)[0];
  idx += stringLiteral.length+1;
  out.push(...Array.from(stringLiteral, ch => ch.charCodeAt(0)));
  idx++;                       // go past the quote
  out.push(0);                 // end of string
  continue;
}

The above does very much the same thing — but instead of us having to do the work of checking character by character, we let JS do it instead with match. We know we’ll get a match (we’re in a string), so we don’t really even need to bother with checking if we get a successful match. Then we just increment our index past the length of the string and copy the characters into our return array. To me, anyway, this is much easier to follow (but does assume you understand regular expressions along with some ES6 syntax like and arrow functions).

Of course, in assembly language we have to do the work that JS is doing for us. This yields a very similar result to our first JS attempt:

  cmp dl, constants.QUOTE         # is dl a quote?
  brs !Z not-a-string             # nope; not a string

  call _get-target-index          # get next position in crunch target
  dl := brodata.TOK_CODE_STRING   # "String" token
  <bp+target>,y := dl             # Store it into the crunch target
  call _adv-target-index

still-a-string:
  call _get-source-index
  dl := <bp+source>,y             # get string character
  call _adv-source-index
  cmp dl, constants.QUOTE         # if it's a quote, we can zero it
  if Z { 
    dl := 0 
  }
  call _get-target-index
  <bp+target>,y := dl             # write to target
  call _adv-target-index
  cmp dl, 0                       # are we done?
  brs !Z still-a-string           # no, keep going
  continue                        # next!
not-a-string:

One thing to note about Retroputer’s assembly language parser — it has some very basic support for higher-level constructs, like blocks and loops. So if Z {…} will execute the contents inside the block if the zero flag is set, and continue can be used to branch back to the top of a block (break also works to exit the block). This is translated to various compare and branch instructions by the assembler, so the CPU doesn’t see any of the high level bits. But it makes writing code just a little easier.

Tokenizing Numbers

There’s also no point in trying to search for our numbers in our list of keywords, and so we may as well skip past those. Most BASICs would do something very similar to the string routine above — as long as the character it read was a digit, it would be concatenated to the output, and the cruncher would carry on.

Retroputer BASIC (and a few other BASICs, like Atari BASIC) goes one step further: it converts the number into the corresponding binary format. This is very easy to do — if you see a digit, multiply an accumulator by 10 and add the digit, and repeat that as long as you see a digit. (I should note, however, that Retroputer BASIC is currently an integer-only affair. Adding floating point is on the todo list, though.)

(I should note that Retroputer BASIC currently does nothing when it comes to integer overflows, signed or otherwise. Once I add floating point, Retroputer will convert to a floating point representation instead.)

Retroputer BASIC also goes another step further: it’ll recognize hexadecimal numbers as well and convert them to their binary equivalent. It uses 0x (just like JS) as the signifier, and has some additional logic to make sure that specifying multiple x characters is considered an error.

In assembly language, checking if a character is a digit isn’t hard, but it is a little verbose, involving a couple of comparisons to see if the character code is between 0x30 and 0x39. (These are the character codes for “0” and “9”, respectively.)

Once we have a digit charachter, we can take advantage of another nicety of the character set. 0x30 is the character code for “0”, 0x31 is the code for “1”, and so on. We could subtract 0x30 if we wanted to, but we have an easier way: just drop the top four bits:

and dl, 0b0000_1111

Unfortunately this doesn’t work if we need to handle hexadecimal numbers. For that, we can subtract and then compare to 10, and then adjust down by 7 again if we were greater than 10 (assuming the hexadecimal digits are uppercase “A”-”Z”).

In JS, we can use regular expressions again to see if we are looking at a number, and then once we have the matched number, we can call Number(), which also gives us another perk: hexadecimal numbers are trivial to convert as well, since Number() will do it automatically if the number starts with 0x.

So what does this look like in JavaScript?

if (ch >= 48 && ch <= 57) {
  out.push (0xFD);           // number token
  const numberLiteralStr = inStr.substr(idx).match(/^[\d|A-F|a-f|x|X]+/)[0];
  idx += numberLiteralStr.length;
  const numberLiteral = Number(numberLiteralStr);
  const bytes = new Uint8Array(new Uint16Array([numberLiteral]).buffer);
  out.push(...bytes)
  continue;
}

The regular expression allows any combination of digits or hexadecimal digits (plus an x to get us into hexadecimal mode). This expression isn’t perfect — it’ll allow multiple xs, for example, but it’s good enough for now.

The bit in the above that raises some eyebrows is converting the number into bytes. Number() did the hard work of turning a string of digits into a number that JavaScript can work with, but now we need it to be represented as a series of bytes. We could do the conversion using some math:

const hiByte = (numberLiteral & 0xFF00) >> 8;
const loByte = (numberLiteral & 0x00FF);

… and for an integer, that’s easy to do. But by using JS’s typed arrays, we can skip the math, and also set ourselves up to handling floating point numbers in the future (we’d just swap Uint16Array for Float64Array.

The assembly language code for accomplishing this is a bit longer, but it’s also doing a little bit more work. Retroputer takes another optimization: if the number is small, it gets stored in a smaller format. This means that 0-255 can be stored in a single byte, while larger numbers take up two.

Looking up Keywords

Okay, so we’ve done a lot of work and we’ve still not actually crunched a keyword. Well, with the number and string literals out of the way, we can be pretty sure that whatever we’re looking at is either a keyword or a variable name. (It could also be a space, but that’s easy to check.)

In BASIC, keywords don’t always start with an alphabetical character — operators and separators also count as tokens. But variables also start with an alphabetical character. So we can’t immediately tell if what we’re about to crunch is a keyword or a variable. That’s OK—if we don’t find a match in our token list, we can assume it’s a variable instead.

So how do we actually check to see if our potential keyword is really a keyword? If we were writing JavaScript, we’d probably reach for the Array#findIndex method.

const tokenMatch = tokens.findIndex(token => 
  inStr.substr(idx).startsWith(token));
if (tokenMatch > -1) {
  out.push(tokenMatch + 128);
  idx += tokens[tokenMatch].length;
  continue;
}

The Array#findIndex method will iterate over the tokens array, and we can test to see if inStr (at the current idx) starts with the token we’re checking. With our simplified token list, we’ll do something like this (let’s imagine inStr.substr(idx)===”PRINT”:

token

.startsWith(token)?

Index

CLS

false

0

PRINT

true

1

Note: Like indexOf in JS, if nothing is found, we’d get a -1 for our troubles.

If we find a match, we can store the index in our return array. But how can we later know the difference between a token and a character? Easy: turn the high bit on, and we can do this by adding 128 to the token value.

(Note: if we needed to have more than 128 tokens, we’d have to use two bytes for some tokens. This makes things a little bit more complicated, but not by much. Some BASICs do this—various flavors of Microsoft BASIC, for example)

So, we’ve done the work in JavaScript, but how in the world can we do this in assembly language?

Well, it turns out, we do it pretty much the same way, but it gets a lot more wordy.

search-keywords:
  bl := [d, x]                 # get first character of current token
  cmp bl, constants.NUL        # if it's NUL, we've exhausted the list
  brs Z exit-keyword-search    # so we're clearly not a keyword
  clr Z                        # compare strings, but with partial equality
  call [vectors.STRCMP]        # so that our source doesn't need NUL between
                               # tokens; c will now be how many chars got 
                               # compared
  if !Z {
    do {                       # no match; advance past our token in the list
      inc x                    # character by character
      bl := [d, x]             # tokens are terminated with NULs
      cmp bl, constants.NUL    # so if we see it, we can stop
    } while !z
    inc x                      # go past the token #
    inc x                      # in the lookup table
    brs search-keywords        # and keep looking for a match
  }
  clr c                        # found the match!
  add x, c                     # c is the length of the match
  inc x                        # past the NUL
  bl := [d, x]                 # bl is now the token #
  call _get-target-index
  <bp+target>,y := bl          # write the token #
  call _adv-target-index
  call _get-source-index       # advance past the token in the source
  clr c
  add y, c                     # by adding the character count
  dec y                        # back off by one (we'll advance again later)
  call _set-source-index
  continue

Okay, that doesn’t look too bad. It’s pretty much the exact same algorithm except that our token table in assembly language is structured a little differently. Namely, it looks something like this:

CLS   00 80
PRINT 00 81
:     00 82

Each keyword is NUL-terminated, and then is followed by the token’s number.

We’re leaving something important out here, though — that is, how in the world did we do the string comparison? Retroputer’s kernel has a STRCMP routine which we can use, but what does that look like?

strcmp: {
  enter 0x00
  push a
  push b
  push d
  push y
  pushf
  if Z {
    bl := 0x00                  # Checking for full equality
  } else {
    bl := 0x01                  # only checking for partial equality
  }
_main:
  y := 0                        # start of string
top:
  cl := [d, x, y]               # character in string A
  al := <bp+4>,y                # character in string B
  cmp bl, 0x01                  # check if we're doing full equality
  if Z {
    cmp cl, 0                   # we're not, so check for an early nul
                                # in string b
    brs Z strings-are-equal     # if it's NUL, we calling them equal
  }
  cmp cl, al                    # check character
  if Z {
    cmp cl, 0                   # equal, but check for NUL
    brs Z strings-are-equal     # NUL reached, strings are equal
    inc y                       # next character
    brs top                     # not NUL, so keep going...
  }
 # if here, the strings aren't equal
  if N {
    popf                        # string is less than
    set N
    clr Z
    brs _out
  } else {
    popf                        # string is greater than
    clr N
    clr Z
    brs _out
  }
strings-are-equal:
  popf
  clr N                         # Not less than
  set Z                         # but Equal
_out:
  c := y                        # make sure we know how many chars
                                # were compared
  pop y
  pop d
  pop b
  pop a
  exit 0x00
  ret
}

I don’t know about you, but I’m loving JS’s String#startsWith method more and more. It does the same thing, granted, but I didn’t have to write its internal logic!

Handling Variables

We could be done at this point — the work of crunching our keywords is complete. Retroputer BASIC does one more step, which is to tokenize the variables. I believe only very few BASICs of the ‘80s and ‘90s did this because it could actually hurt in limited memory conditions. But if you have lots of memory, tokenizing your variables can help with performance.

Here’s what Retroputer BASIC does:

  • It reads up to the first two characters of the variable name. (This was a common affectation of BASICs of the time due to memory constraints.)
  • From these two characters, it determines a variable index. “A” is variable zero, “A0” is variable 53, etc. The equation isn’t difficult, but not really the point here.
  • BASIC continues to scan for the variable’s type sigil. In BASIC, $ signifies a string variable, for example. The variable type is stored in a couple of bits high in the variable index.
  • BASIC then writes the type and index to the output, and then writes the variable name’s length in addition to the variable name itself. This is where we lose space savings!

(Note: when Retroputer BASIC can allocate variables dynamically, the index will be replaced with a pointer to the variable instead. Another thing on the todo list!)

This makes variable lookup during execution extremely fast: you don’t have to parse the variable name and calculate the index each time you encounter the variable. In a tight loop, that savings can add up. But it also comes at a huge cost: we have to store the pointer and the variable name together, and so we tack on an extra four bytes in the output for every variable we see.

It’s up to you to decide if that’s worth it.

Regardless, determining if what’s left is indeed a variable in our character stream is also easy in JavaScript:

const variableMatch = inStr.substr(idx).match(/^[A-Z]+[A-Z0-9]*[\$]*/);
if (variableMatch) {
  const variableName = variableMatch[0];
  idx += variableName.length;
  // tokenize it (omitted; nothing new here)
  continue;
}

I’m not going to go into the code Retroputer does at this point to tokenize variables — it’s very wordy, but not particularly interesting… yet. When I add dynamic variable allocation, I’ll revisit this.

Tokenization Complete

Now, if you test our JS code, you’ll get an array of bytes similar to what Retroputer BASIC uses internally:

console.log(toHex(tokenize(`CLS: PRINT "Hello, World"`)));
// 80 82 20 81 20 FB 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 00 00 

Wow — so that was a lot of work for a few bytes worth of savings. But if you’ve only got a couple kilobytes of free memory, it’s worth it! But that’s not the only benefit we get from crunching the user’s input. We get fast execution times as well.

To explain that, we’re going to need to look at how Retroputer executes code. I’m not going into all the details yet, but the short story is that we can do the following to execute code:

  • Get a byte from memory
  • If the byte has the high bit set, it’s a token — otherwise it’s a syntax error (or NUL — in which case, we’re done with the program line)
  • Look the token’s handler up in an array — this array has pointers to the actual functions that operate on the token
  • Call the handler (the handler is responsible for consuming arguments and the like)
  • Repeat

This is the other reason why we tokenize keywords — it becomes trivial to execute the logic for the keyword simply by looking up an address in an array and calling it.

We’ll cover the execution aspect in more details, but I wanted to note that although tokenization is important for space savings, it also improves the execution speed.

For example, in JS, an execution loop could be something like this:

const handlers = [ cls, print, endOfStmt ];
bytes.forEach(byte => handlers[byte] && handlers[byte]());

Of course it’s not quite that simple — there’s a lot more that goes into it, but that’s for another post!


Resources

Revisions

  • August 11, 2020: Added a link to the redux to part 1 of this series, which has more resources that I’ve been using.
  • August 9, 2020: Added some links to resources



kdgarris picture

Hi Kerri,

I enjoyed reading your entries re: building a BASIC interpreter, as I am working on one myself (6502 processor for the Atari 7800). I have a question: regarding tokenization, how do you distinguish your converted integers that have a value of zero vs a NUL to end your line?

Looking forward to Part 4?. Very much appreciating the single steps you are going through. It's filled in a couple of comprehension holes for me.