I wrote my own lexer

For my handlebars-ng-project, I was looking for a lexer, and I found the moo-lexer. The simplicity of configuration and the fact that it has zero dependencies convinced me to use it.

Creating a tokenizer is really simple:

const lexer = moo.compile({ ALPHA: /[A-Za-z]/, DIGIT: /\d/, COLON: ":" });
lexer.reset("ab1:2");
const tokens = [...lexer];

/*
tokens = [
  '{"type":"ALPHA", "value":"a", "text":"a", "offset":0, "lineBreaks":0, "line":1, "col":1}',
  '{"type":"ALPHA", "value":"b", "text":"b", "offset":1, "lineBreaks":0, "line":1, "col":2}',
  '{"type":"DIGIT", "value":"1", "text":"1", "offset":2, "lineBreaks":0, "line":1, "col":3}',
  '{"type":"COLON", "value":":", "text":":", "offset":3, "lineBreaks":0, "line":1, "col":4}',
  '{"type":"DIGIT", "value":"2", "text":"2", "offset":4, "lineBreaks":0, "line":1, "col":5}'
]
*/

You can have more complicated rules as well, and multiple states.

const lexer = moo.states({
  main: {
    OPEN: { match: "(", push: "footnote" },
    CONTENT: { fallback: true },
  },
  footnote: {
    CLOSE: { match: ")", pop: 1 },
    DIGIT: /\d/,
  },
});
lexer.reset(
  "There are 4 sides of the history (1): My side, your side, the truth and what really happened. (2)",
);
const tokens = [...lexer];

Only the rules of the current state apply and the fallback-rule matches everything between matches of other rules. In the above example, the 4 is not recognized as DIGIT, because it is not enclosed by brackets.

All this I could use well for my Handlebars rewrite, so I started using moo and I was happy…

… until I got to the point where I needed something that moo didn’t offer

Look-ahead

Some tokenizer rules of the original Handlebars project have a look-ahead parameter. For example

const rule = { regex: /ab/, lookahead: /c/ };

means that rule should match ab but only if it is followed by c and c should not be part of the match. There is no lookahead option in moo.

Spoiler: moo does not need to have a look-ahead option, because you can use assertion syntax in Regular Expressions. ab(?=c) would have solved my problem and I would never have stepped into the following venture. But I found out about that only when I tried to implement lookahead in own lexer.

So what I did was look at moo’s code and I didn’t like it too much…

One big file

At the time of this writing, the moo-lexer consists of a single JavaScript file which is 642 lines long. It starts with utility functions, followed by functions that compile the config into a lexer, then followed by [the Lexer class]in line 399 itself.

It contains some very large functions: The compileRule-function is 110 lines long.

The _token-function is not as long as “compileRule”, but it does a variety of things.

Having seen a lot of talks about Clean Code, SOLID principles and code architecture lately, I thought: “Maybe I should just try and write my own moo, use it to practice test-driven development, and see if I can apply all these principles?” More importantly I wanted to find out how the focus on clean and readable code would affect performance and bundle size.

On difference between my goals and moo certainly was that I didn’t have to care about legacy browsers. I wanted to use the tooling that I usually use: TypeScript, vite and vitest…

I also didn’t implement everything that moo is able to do, so the size comparison may a bit flawed.

How does `moo` work?

Let’s start with the simple configuration:

moo.compile({ ALPHA: /[A-Za-z]/, DIGIT: /\d/, COLON: ":", COMA: "," });

Regex matching

What moo will do here essentially, is to combine all rules into one large regex similar to

const regex = /([A-Za-z])|(\d)|(:)|(,)/y;

The y flag makes sure that regex.exec() does not match anywhere after, but only exactly at regex.lastIndex.

It then inspects the capturing groups of the result to find out which of the rules actually matched.

If the regex matches for example a 0, the result of regex.exec() is ["0", null, "0", null, null].

Whole match: 0
First group ([A-Za-z]): null
Second group (\d): 0

Ah, so the match was due to the DIGIT rule.

Fallback rules

If a fallback rule is applied, the regex gets the g flag instead of y, which will then look for matches further ahead in the string. If there is a “gap” between the last and the current match, a fallback token will be created and the current match will be stored as queued and then returned in the next next call.

Fast matching

There is a different matching strategy for matching tokens: The rules for COLON and COMA just consist of a single char and there is no fallback rule in our configuration, so ´moo` can use a much simpler and faster algorithm before running the regex:

During compile time, it creates an array where each rule is stored at the index of its strings char-code

const fastLookup = [];
fastLookup[":".charCodeAt(0)] = colonRule;
fastLookup[",".charCodeAt(0)] = comaRule;

In order to find the rule matching the current char, we just need to look into the array:

const matchingRule = fastLookup[string.charCodeAt(offset)];

This resolves all single-char rules with a single array-lookup and is very fast.

Position tracking

When each match, moo inspects the matched string, counts linebreaks and the size of the last line. This allows it to keep track of the current position. The linebreak count and the current offset become part of the token.

At this point, I should mention that moo’s token format was not exactly what I needed. I needed “start” and “end” position of a token, not the number of line-breaks. I decided for my lexer to return my required token format instead of moo’s.

My own lexer

Since moo obviously comes from the cow, I decided to sheepishly name my lexer baa-lexer. I actually wrote the lexer three times as an exercise. Internally, it now works with the following data structures:

BaaToken is my token format. It contains { type, original, value, start, end }, where start and end both contain line and column. It resembles format used by Handlebars.
We include the token type in the Rule object, so that at “runtime” the code only has to deal with rules of the form { tokenType, match?, push?, pop?, next?, value? } and not with its short forms.
In addition to the “token” itself, we introduce a lightweight Match object, which contains the fields: { rule, text, offset }. That way, we can separate the code that finds a Match from the code that actually creates the token and performs the location tracking.

The core of baa consists of interfaces for various components:

Matcher has a match method that returns a match in a given string at a given offset. just the offset within the string, the matching rule and the text. The default implementation is a composition of the RegexMatcher and the StickySingleCharMatcher, which implement the two matching strategies described above.
StateProcessor processes matches in a given State. There is a default implementation that uses a Matcher to find the next match. Between last and current Match it creates a fallback match if necessary, queuing the current match as “pending”.
TokenFactory keeps track of the current location (line, column) and has a method to create a token from a match. There is a default implementation for that as well, and it creates BaaTokens.
Lexer has a lex(string) function that returns a IterableIterator<BaaToken>. The default implementation uses a Tokenfactory and StateProcessor instances to tokenize the text. Its main responsibility is to keep track of state changes and call the correct StateProcessor.
The baa functions creates all necessary instances of those classes and connects them.

Notice how I always write: “There is a default implementation”. In theory, there can be custom implementations of each of those modules. You want to use a custom for finding matches? Write your own matcher. You want to process a state differently? Write your own state processor.

It is not possible yet inject a custom TokenFactory, but this is due to the fact that the TypeScript typings become a lot more complicated by this, and that I just don’t need this feature. But it can be done without affecting e.g. the matching algorithm.

My source code now consists of about 12 files, each having their own responsibility. In my opinion easy to understand and to change.

Bundle size and benchmarks

You might ask: So many classes, what about the bundle size? What about performance?

Since I didn’t implement all of moo’s features, the bundle-size question may be a bit unfair. moo claims to be “4KB minified + gzipped”. baa now has about 2.2KB… I don’t think that implementing the remaining features would use another 1.8KB, so I think this is fine.

Performance measures also show that my implementation is slightly faster than moo, but this is also no conclusive test.

 BENCH  Summary

  moo - performance/moo-baa.bench.ts > moo-baa test: './tests/abab.ts' (+0)
    1.07x faster than baa

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/fallback.ts' (+0)
    1.19x faster than moo

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/handlears-ng.ts' (+0)
    1.50x faster than moo

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/handlears-ng.ts' (1)
    1.25x faster than moo

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/handlears-ng.ts' (2)
    1.19x faster than moo

  baa - performance/moo-baa.bench.ts > moo-baa test: './tests/json-regex.ts' (+0)
    1.15x faster than moo

  moo - performance/moo-baa.bench.ts > moo-baa test: './tests/json.ts' (+0)
    1.04x faster than baa

Typescript typing

What I really like is getting support from the TypeScript compiler. This may not be perfect yet, but with the baa-lexer, typings, you actually get auto-completion and validation for token-types and state-names in many places.

types

The good thing is that typings don’t add to the bundle size for browsers.

Conclusion

logo

I have watched a lot of videos lately, about SOLID principles, clean architecture and design and TDD. I tried to apply those principles when implementing baa-lexer and I think that went quite well.

In the next posts I will talk a bit about what exactly I did to improve performance and reduce the bundle size of the project.