Writing Search Query Lexer and Parser Using State Machine

I can't believe I wrote this thing without Regex...

By Muhammad Rizqi Ardiansyah on 2024-08-02

So I have been developing a product in my own spare time lately, it's called Tagbox. It's a bookmarking application similar to Zotero, Raindrop.io, or Pocket.

As the name of the application implies a bit, the application can categorizes each links using tags. This way, the user can search for links easily. Yeah, search functionality. That's when I thought: how do users search using tags?

Of course, by typing the tags as search query. But then, my brain goes further: what if user want some logic in the search. For example, user wants to search for links that have the tags video and music but want to exclude the tag article? Hmmm.

That's when I thought: why not I make my own syntax for searching?

This is what I came up with:

Syntax of the search query

Demo Using WASM

Here is the demo of the parser if you're curious.

So let's start implementing it. I will use Go for the language, as that is the language I'm using for Tagbox.

Rob Pike's talk about lexer

I will be following along a talk by Rob Pike's from more than a decade ago. Here is the video and here is the slides.

Setting some constants

Using the Iota feature from Go and additionally some type aliasing, we can define some constants that act like an "enum" for the type of value--like text, whitespace, end-of-file (EOF), folder name, etc.

type itemType int

type item struct {
	typ itemType
	val string
}

const (
	itemError itemType = iota

	itemEOF
	itemWhitespace
	itemText
	itemFolderNameMeta
	itemFolderNameIdentifier
	itemContentSearchWords
	itemTagsLeftMeta
	itemTagsDivider
	itemTagOperator
	itemTagName
	itemTagsRightMeta
)

We will also define some constant of characters like slash or bracket:

const whitespace = " "
const folderNameMeta = "/"
const tagsLeftMeta = "["
const tagsDivider = " "
const tagsRightMeta = "]"
const tagNegationOperator = "-"
const tagJoinOperator = "+"
const eof = -1

Lexer

Next, I will define a struct for lexer.

type lexer struct {
	name  string
	input string
	start int
	pos   int
	width int
	items []item
}

The struct contains the input string, position of the lexer is working on, and the result in the field items. Okay, I will admit--the field items was supposed to be a channel to enable concurrency like Rob Pike stated, but... I wasn't able to implement them, so for now I will choose the more naive method. Besides, I'm only lexing for relatively short search query, not designing a template engine for Go.

Methods for lexer

Next we will write some helpful methods for the lexer struct.

func (l *lexer) emit(t itemType) {
	l.items = append(l.items, item{t, l.input[l.start:l.pos]})
	l.start = l.pos
}

emit will append item that contains lexed string and the type to the result slice.

func (l *lexer) next() (r rune) {
	if l.pos >= len(l.input) {
		l.width = 0
		return rune(eof)
	}
	r, l.width = utf8.DecodeRuneInString(l.input[l.pos:])
	l.pos += l.width
	return r
}

next will return the next character as rune type. It will also return eof constant if the position reached the end. This can be great for error handling

State machine

Turns out implementing state machines in Go is quite simple. The states are represented as a function. A lexer struct will be received as the argument of the function and will return the next state. If returns nil, then it's the end of the state machine.

type stateFn func(*lexer) stateFn

Running the state machine

Other method lexer has is run which will start the state machine. This will run a for loop until state machine returns nil.

func (l *lexer) run() {
	for state := lexText; state != nil; {
		state = state(l)
	}
}

State function

Okay, prepare for the spaghetti code.

The state machine starts with lexText state, which means a plain text.

If the current position of the string starts with / (slash) it will mean the next state will be the symbol that starts before a folder name (the lexFolderNameMeta function which will be followed by lexFolderNameIdentifier).

However if the current position starts with [ (open square bracket) it will mean the next state will be tags.

func lexText(l *lexer) stateFn {
	for {
		if strings.HasPrefix(l.input[l.pos:], folderNameMeta) {
			if l.pos > l.start {
				l.emit(itemText)
			}
			return lexFolderNameMeta
		}
		if strings.HasPrefix(l.input[l.pos:], tagsLeftMeta) {
			if l.pos > l.start {
				l.emit(itemText)
			}
			return lexTagsLeftMeta
		}
		if l.next() == eof {
			break
		}
	}
	if l.pos > l.start {
		l.emit(itemText)
	}
	l.emit(itemEOF)
	return nil
}

func lexFolderNameMeta(l *lexer) stateFn {
	l.pos++
	l.emit(itemFolderNameMeta)
	return lexFolderNameIdentifier
}

func lexFolderNameIdentifier(l *lexer) stateFn {
	for {
		if strings.HasPrefix(l.input[l.pos:], whitespace) {
			l.emit(itemFolderNameIdentifier)
			return lexText
		}
		r := l.next()
		if r == eof {
			break
		}
	}

	l.emit(itemFolderNameIdentifier)
	l.emit(itemEOF)
	return nil
}

func lexTagsLeftMeta(l *lexer) stateFn {
	l.pos++
	l.emit(itemTagsLeftMeta)
	return lexTagName
}

func lexTagName(l *lexer) stateFn {
	for {
		if strings.HasPrefix(l.input[l.pos:], tagsDivider) {
			l.emit(itemTagName)
			return lexTagsDivider
		}
		if strings.HasPrefix(l.input[l.pos:], tagsRightMeta) {
			l.emit(itemTagName)
			return lexTagsRightMeta
		}
		r := l.next()
		if r == eof {
			break
		}
	}

	l.emit(itemEOF)
	return nil
}

func lexTagsDivider(l *lexer) stateFn {
	l.pos++
	l.emit(itemTagsDivider)
	return lexTagName
}

func lexTagsRightMeta(l *lexer) stateFn {
	l.pos++
	l.emit(itemTagsRightMeta)
	return nil
}