If you recall from part one of the series, we only really delved into Retroputer BASIC’s first phase of parsing — namely, converting the entered line into uppercase and checking if quotes were properly matched. (Note: it should be mentioned that we could do additional checks there too — like matching parentheses, but we don’t at the moment.)
I also mentioned that I wanted to address how much we take even fairly limited standard libraries for granted today (especially in relationship to JavaScript). Nothing described here is magic, but sometimes we take our tools for granted, and I find it useful to understand what’s going underneath the hood.
Tokenizing User Input
As mentioned in the last post, a common tactic used by BASICs of the era was that of tokenization. In order to save space and increase execution speed, keywords were “crunched” into single byte token (or two bytes if you needed more keywords).
Let’s imagine that you have a simple line of code that clears the screen and prints the now common greeting, like the following.
CLS: PRINT "Hello, World"
While this doesn’t take up a lot of memory on its own, if you write a long program, a lot of the words in that program will be keywords. If we crunched them (tokenized them), we could save a decent amount of space. For example, after tokenizing the line above, you’d see something more like this in memory:
8A E3 B5 FC Hello, World 00 00
It doesn’t take too much effort to map this back to our original statement:
Byte |
Keyword |
Notes |
---|---|---|
8A |
CLS |
|
E3 |
“:” |
End of Statement |
32 |
“ “ |
We store up to one space |
B5 |
|
|
FB |
String-in-code |
String literal follows |
Hello, World, 00 |
Strings are NUL-terminated |
|
00 |
End Of Line |
Lines are also NUL-terminated |
This may seem like a lot of work, but the space savings can be significant. It’s not a lot here, but even so, you should be able to imagine how it could quickly add up. In this case, our crunched result is 19 bytes. The original text is 26 bytes (counting a terminating NUL).
Saving some space becomes important if you’re a BASIC interpreter running on a machine with very limited RAM. Some BASICs had less than a kilobyte of RAM for user programs, so this crunching was indeed very important, and is something that would have been appealing even if it came with no additional benefits.
OK — so just how do we go about tokenizing something like this? It initially seems fairly trivial in JavaScript. Given an array of strings, you could easily do a quick replace of each keyword with the corresponding token. Right?
Sounds like a job for String#replace
, right? Naively, we might try the following:
const tokens = ["CLS", "PRINT", ":" ];
function tokenize (inStr) {
const newStr = inStr;
tokens.forEach((token, idx) => newStr = newStr.replace(
new RegExp(token, "g"), String.fromCharCode(128+idx)
);
return newStr;
}
Unfortunately, it all comes crashing down when you realize that we can’t go about replacing tokens inside of string literals. This means we need to proceed character by character, keeping the context in mind, in order to avoid crunching things that aren’t actually keywords.
This new algorithm much more closely matches that of the assembly language code in Retroputer — but JS still makes things a lot easier. Here’s the start of that JS code (we’ll fill it in throughout the post):
const tokens = ["CLS", "PRINT", ":" ];
function tokenize(inStr) {
let out = []; // return value is "crunched" array
let idx = 0; // index into inStr
let ch = ""; // current character
while (idx < inStr.length) {
ch = inStr.charCodeAt(idx); // get the current character code
// our logic is going to go here
out.push(ch); // for now push the character
idx++; // and keep going (next char)
}
out.push(0); // we end with NUL
return out;
}
We’re starting with a very simplified token list, as no one wants to see the entire table in this format (it’s long, and Retroputer actually builds its token tables from a JS file!), but this should be enough for our purposes here. The idea here is that when we see a token, we’ll record its index in an output array.
At this point we have a function that, for now, just converts a string into its equivalent character codes. It doesn’t do much useful at this point, but can serve as useful scaffolding.
Our assembly language version is pretty similar to the above as well.
do {
call _get-source-index # get next character index (in y)
dl := <BP+source>,y # read next char from our source string (ch)
call _adv-source-index # advance our source index
call _get-target-index # get position in target ("out" in JS)
<BP+target>,y := dl # store to target (out[y] = ch)
call _adv-target-index # move it along
cmp dl, 0 # was char 0? if so, break
} while !z
I’ve not included _get-source-index
or the other functions above because they do what they say on the tin, and are simply getting, setting, or advancing our source and target index. One thing of note is that unlike in JS, we don’t have dynamically allocated arrays in assembly language, and so this algorithm pre-allocates a reasonably-sized buffer. As we advance through the input string, we have to know where to write the next token to the target buffer, and that’s what the target index is doing above. Each one of the functions we’re calling above returns the index in Y
. For example, The _adv-target-index
function advances the target index by one (equivalent to y++
).
Careful with Literals
One thing that we should be careful of is that string literals could confuse our tokenizer — we can’t just replace every string of characters that matches a keyword with a token value. If we see a string literal with “CLS” in it, we shouldn’t try to tokenize that. It’s not intended to be executable, and if we print it… well, we’d print the byte corresponding with the token instead. Not likely what the developer wanted.
Number literals, on the other hand, shouldn’t have the same problem since BASIC doesn’t have any numbers as keywords. Even so, there’s zero point in doing a keyword search if we’re looking at a number — why waste time?
Tokenizing String Literals
So, first, let’s check if we’re starting a string — that’s not too hard to do in JS:
if (ch === 34) {
out.push (0xFB); // string-in-code token
idx++;
ch = inStr.charCodeAt(idx); // get next character after the quote
while (ch !== 34 && idx < inStr.length) {
out.push(ch);
idx++;
ch = inStr.charCodeAt(idx);
};
idx++; // go past the quote
out.push(0); // end of string
continue;
}
A double quote is represented as character code 34. Other programming languages recognize many more quote styles (such as JS or C), but BASIC is simple here: double quotes or bust.
Once we see that we’re starting a string, we can simply consume the remaining characters and add them to our return array until we see another quote.
When we’ve read in the whole string, we need to add a NUL byte, since Retroputer BASIC uses C-style strings. If we wanted to use Pascal-style strings, we could have maintained a counter and made sure to insert the length of the string literal. Not a big deal either way. The only reason I went with NUL-terminated strings is because that’s super easy to deal with in assembly language since we can just compare against the NUL byte instead of maintaining a counter.
Ok, so that JavaScript wasn’t too hard. Most of us, I think, would reach for something a bit more abstract, like a regular expression. It makes the intent a little more obvious, I think.
if (ch === 34) {
out.push (0xFB); // string-in-code token
const stringLiteral = inStr.substr(idx+1).match(/^[^"]*/)[0];
idx += stringLiteral.length+1;
out.push(...Array.from(stringLiteral, ch => ch.charCodeAt(0)));
idx++; // go past the quote
out.push(0); // end of string
continue;
}
The above does very much the same thing — but instead of us having to do the work of checking character by character, we let JS do it instead with match
. We know we’ll get a match (we’re in a string), so we don’t really even need to bother with checking if we get a successful match. Then we just increment our index past the length of the string and copy the characters into our return array. To me, anyway, this is much easier to follow (but does assume you understand regular expressions along with some ES6 syntax like …
and arrow functions).
Of course, in assembly language we have to do the work that JS is doing for us. This yields a very similar result to our first JS attempt:
cmp dl, constants.QUOTE # is dl a quote?
brs !Z not-a-string # nope; not a string
call _get-target-index # get next position in crunch target
dl := brodata.TOK_CODE_STRING # "String" token
<bp+target>,y := dl # Store it into the crunch target
call _adv-target-index
still-a-string:
call _get-source-index
dl := <bp+source>,y # get string character
call _adv-source-index
cmp dl, constants.QUOTE # if it's a quote, we can zero it
if Z {
dl := 0
}
call _get-target-index
<bp+target>,y := dl # write to target
call _adv-target-index
cmp dl, 0 # are we done?
brs !Z still-a-string # no, keep going
continue # next!
not-a-string:
One thing to note about Retroputer’s assembly language parser — it has some very basic support for higher-level constructs, like blocks and loops. So if Z {…}
will execute the contents inside the block if the zero flag is set, and continue
can be used to branch back to the top of a block (break
also works to exit the block). This is translated to various compare and branch instructions by the assembler, so the CPU doesn’t see any of the high level bits. But it makes writing code just a little easier.
Tokenizing Numbers
There’s also no point in trying to search for our numbers in our list of keywords, and so we may as well skip past those. Most BASICs would do something very similar to the string routine above — as long as the character it read was a digit, it would be concatenated to the output, and the cruncher would carry on.
Retroputer BASIC (and a few other BASICs, like Atari BASIC) goes one step further: it converts the number into the corresponding binary format. This is very easy to do — if you see a digit, multiply an accumulator by 10 and add the digit, and repeat that as long as you see a digit. (I should note, however, that Retroputer BASIC is currently an integer-only affair. Adding floating point is on the todo list, though.)
(I should note that Retroputer BASIC currently does nothing when it comes to integer overflows, signed or otherwise. Once I add floating point, Retroputer will convert to a floating point representation instead.)
Retroputer BASIC also goes another step further: it’ll recognize hexadecimal numbers as well and convert them to their binary equivalent. It uses 0x
(just like JS) as the signifier, and has some additional logic to make sure that specifying multiple x
characters is considered an error.
In assembly language, checking if a character is a digit isn’t hard, but it is a little verbose, involving a couple of comparisons to see if the character code is between 0x30
and 0x39
. (These are the character codes for “0” and “9”, respectively.)
Once we have a digit charachter, we can take advantage of another nicety of the character set. 0x30
is the character code for “0”, 0x31
is the code for “1”, and so on. We could subtract 0x30
if we wanted to, but we have an easier way: just drop the top four bits:
and dl, 0b0000_1111
Unfortunately this doesn’t work if we need to handle hexadecimal numbers. For that, we can subtract and then compare to 10, and then adjust down by 7 again if we were greater than 10 (assuming the hexadecimal digits are uppercase “A”-”Z”).
In JS, we can use regular expressions again to see if we are looking at a number, and then once we have the matched number, we can call Number()
, which also gives us another perk: hexadecimal numbers are trivial to convert as well, since Number()
will do it automatically if the number starts with 0x
.
So what does this look like in JavaScript?
if (ch >= 48 && ch <= 57) {
out.push (0xFD); // number token
const numberLiteralStr = inStr.substr(idx).match(/^[\d|A-F|a-f|x|X]+/)[0];
idx += numberLiteralStr.length;
const numberLiteral = Number(numberLiteralStr);
const bytes = new Uint8Array(new Uint16Array([numberLiteral]).buffer);
out.push(...bytes)
continue;
}
The regular expression allows any combination of digits or hexadecimal digits (plus an x
to get us into hexadecimal mode). This expression isn’t perfect — it’ll allow multiple x
s, for example, but it’s good enough for now.
The bit in the above that raises some eyebrows is converting the number into bytes. Number()
did the hard work of turning a string of digits into a number that JavaScript can work with, but now we need it to be represented as a series of bytes. We could do the conversion using some math:
const hiByte = (numberLiteral & 0xFF00) >> 8;
const loByte = (numberLiteral & 0x00FF);
… and for an integer, that’s easy to do. But by using JS’s typed arrays, we can skip the math, and also set ourselves up to handling floating point numbers in the future (we’d just swap Uint16Array
for Float64Array
.
The assembly language code for accomplishing this is a bit longer, but it’s also doing a little bit more work. Retroputer takes another optimization: if the number is small, it gets stored in a smaller format. This means that 0-255 can be stored in a single byte, while larger numbers take up two.
Looking up Keywords
Okay, so we’ve done a lot of work and we’ve still not actually crunched a keyword. Well, with the number and string literals out of the way, we can be pretty sure that whatever we’re looking at is either a keyword or a variable name. (It could also be a space, but that’s easy to check.)
In BASIC, keywords don’t always start with an alphabetical character — operators and separators also count as tokens. But variables also start with an alphabetical character. So we can’t immediately tell if what we’re about to crunch is a keyword or a variable. That’s OK—if we don’t find a match in our token list, we can assume it’s a variable instead.
So how do we actually check to see if our potential keyword is really a keyword? If we were writing JavaScript, we’d probably reach for the Array#findIndex
method.
const tokenMatch = tokens.findIndex(token =>
inStr.substr(idx).startsWith(token));
if (tokenMatch > -1) {
out.push(tokenMatch + 128);
idx += tokens[tokenMatch].length;
continue;
}
The Array#findIndex
method will iterate over the tokens
array, and we can test to see if inStr
(at the current idx
) starts with the token we’re checking. With our simplified token list, we’ll do something like this (let’s imagine inStr.substr(idx)===”PRINT”
:
token |
.startsWith(token)? |
Index |
---|---|---|
CLS |
false |
0 |
|
true |
1 |
Note: Like indexOf
in JS, if nothing is found, we’d get a -1
for our troubles.
If we find a match, we can store the index in our return array. But how can we later know the difference between a token and a character? Easy: turn the high bit on, and we can do this by adding 128 to the token value.
(Note: if we needed to have more than 128 tokens, we’d have to use two bytes for some tokens. This makes things a little bit more complicated, but not by much. Some BASICs do this—various flavors of Microsoft BASIC, for example)
So, we’ve done the work in JavaScript, but how in the world can we do this in assembly language?
Well, it turns out, we do it pretty much the same way, but it gets a lot more wordy.
search-keywords:
bl := [d, x] # get first character of current token
cmp bl, constants.NUL # if it's NUL, we've exhausted the list
brs Z exit-keyword-search # so we're clearly not a keyword
clr Z # compare strings, but with partial equality
call [vectors.STRCMP] # so that our source doesn't need NUL between
# tokens; c will now be how many chars got
# compared
if !Z {
do { # no match; advance past our token in the list
inc x # character by character
bl := [d, x] # tokens are terminated with NULs
cmp bl, constants.NUL # so if we see it, we can stop
} while !z
inc x # go past the token #
inc x # in the lookup table
brs search-keywords # and keep looking for a match
}
clr c # found the match!
add x, c # c is the length of the match
inc x # past the NUL
bl := [d, x] # bl is now the token #
call _get-target-index
<bp+target>,y := bl # write the token #
call _adv-target-index
call _get-source-index # advance past the token in the source
clr c
add y, c # by adding the character count
dec y # back off by one (we'll advance again later)
call _set-source-index
continue
Okay, that doesn’t look too bad. It’s pretty much the exact same algorithm except that our token table in assembly language is structured a little differently. Namely, it looks something like this:
CLS 00 80
PRINT 00 81
: 00 82
Each keyword is NUL-terminated, and then is followed by the token’s number.
We’re leaving something important out here, though — that is, how in the world did we do the string comparison? Retroputer’s kernel has a STRCMP
routine which we can use, but what does that look like?
strcmp: {
enter 0x00
push a
push b
push d
push y
pushf
if Z {
bl := 0x00 # Checking for full equality
} else {
bl := 0x01 # only checking for partial equality
}
_main:
y := 0 # start of string
top:
cl := [d, x, y] # character in string A
al := <bp+4>,y # character in string B
cmp bl, 0x01 # check if we're doing full equality
if Z {
cmp cl, 0 # we're not, so check for an early nul
# in string b
brs Z strings-are-equal # if it's NUL, we calling them equal
}
cmp cl, al # check character
if Z {
cmp cl, 0 # equal, but check for NUL
brs Z strings-are-equal # NUL reached, strings are equal
inc y # next character
brs top # not NUL, so keep going...
}
# if here, the strings aren't equal
if N {
popf # string is less than
set N
clr Z
brs _out
} else {
popf # string is greater than
clr N
clr Z
brs _out
}
strings-are-equal:
popf
clr N # Not less than
set Z # but Equal
_out:
c := y # make sure we know how many chars
# were compared
pop y
pop d
pop b
pop a
exit 0x00
ret
}
I don’t know about you, but I’m loving JS’s String#startsWith
method more and more. It does the same thing, granted, but I didn’t have to write its internal logic!
Handling Variables
We could be done at this point — the work of crunching our keywords is complete. Retroputer BASIC does one more step, which is to tokenize the variables. I believe only very few BASICs of the ‘80s and ‘90s did this because it could actually hurt in limited memory conditions. But if you have lots of memory, tokenizing your variables can help with performance.
Here’s what Retroputer BASIC does:
- It reads up to the first two characters of the variable name. (This was a common affectation of BASICs of the time due to memory constraints.)
- From these two characters, it determines a variable index. “A” is variable zero, “A0” is variable 53, etc. The equation isn’t difficult, but not really the point here.
- BASIC continues to scan for the variable’s type sigil. In BASIC,
$
signifies a string variable, for example. The variable type is stored in a couple of bits high in the variable index. - BASIC then writes the type and index to the output, and then writes the variable name’s length in addition to the variable name itself. This is where we lose space savings!
(Note: when Retroputer BASIC can allocate variables dynamically, the index will be replaced with a pointer to the variable instead. Another thing on the todo list!)
This makes variable lookup during execution extremely fast: you don’t have to parse the variable name and calculate the index each time you encounter the variable. In a tight loop, that savings can add up. But it also comes at a huge cost: we have to store the pointer and the variable name together, and so we tack on an extra four bytes in the output for every variable we see.
It’s up to you to decide if that’s worth it.
Regardless, determining if what’s left is indeed a variable in our character stream is also easy in JavaScript:
const variableMatch = inStr.substr(idx).match(/^[A-Z]+[A-Z0-9]*[\$]*/);
if (variableMatch) {
const variableName = variableMatch[0];
idx += variableName.length;
// tokenize it (omitted; nothing new here)
continue;
}
I’m not going to go into the code Retroputer does at this point to tokenize variables — it’s very wordy, but not particularly interesting… yet. When I add dynamic variable allocation, I’ll revisit this.
Tokenization Complete
Now, if you test our JS code, you’ll get an array of bytes similar to what Retroputer BASIC uses internally:
console.log(toHex(tokenize(`CLS: PRINT "Hello, World"`)));
// 80 82 20 81 20 FB 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 00 00
Wow — so that was a lot of work for a few bytes worth of savings. But if you’ve only got a couple kilobytes of free memory, it’s worth it! But that’s not the only benefit we get from crunching the user’s input. We get fast execution times as well.
To explain that, we’re going to need to look at how Retroputer executes code. I’m not going into all the details yet, but the short story is that we can do the following to execute code:
- Get a byte from memory
- If the byte has the high bit set, it’s a token — otherwise it’s a syntax error (or NUL — in which case, we’re done with the program line)
- Look the token’s handler up in an array — this array has pointers to the actual functions that operate on the token
- Call the handler (the handler is responsible for consuming arguments and the like)
- Repeat
This is the other reason why we tokenize keywords — it becomes trivial to execute the logic for the keyword simply by looking up an address in an array and calling it.
We’ll cover the execution aspect in more details, but I wanted to note that although tokenization is important for space savings, it also improves the execution speed.
For example, in JS, an execution loop could be something like this:
const handlers = [ cls, print, endOfStmt ];
bytes.forEach(byte => handlers[byte] && handlers[byte]());
Of course it’s not quite that simple — there’s a lot more that goes into it, but that’s for another post!
Resources
- Full JavaScript code for this post
- Retroputer Source (at this point in history)
- List of resources I’ve been using for this project (from the redux to part 1 of this series)
Revisions
- August 11, 2020: Added a link to the redux to part 1 of this series, which has more resources that I’ve been using.
- August 9, 2020: Added some links to resources
My cousin was telling me about Mostbet, and I got curious, so I wanted to find out more. That’s when I found https://casinos-india.net/mostbet/ . This site goes in-depth about Mostbet’s casino section, explaining everything from bonuses to game variety. I appreciated the part where they break down how to claim the welcome bonus and use different payment options. It's super helpful, especially if you're new to online casinos.
Hello, I want to share with you my opinion on Ovechkin's incredible achievement. Mostbet online is a platform where you can easily follow such outstanding events in the world of sports. When I found out that Ovechkin had scored 850 goals for one NHL club, it just blew me away. This record shows how talented and persistent he is. It's hard to imagine that even the legendary Gretzky couldn't achieve this result. This achievement is inspiring and shows that hard work and passion for your work can lead to incredible heights. I highly recommend this article!
Hi Kerri,
I enjoyed reading your entries re: building a BASIC interpreter, as I am working on one myself (6502 processor for the Atari 7800). I have a question: regarding tokenization, how do you distinguish your converted integers that have a value of zero vs a NUL to end your line?
Looking forward to Part 4?. Very much appreciating the single steps you are going through. It's filled in a couple of comprehension holes for me.