Extended SQL Parser

This documentation explains the technical decision made in building a SQL parser in Go. It is used to parsed the extended SELECT syntax of SQL that integrates TensorFlow Estimators.

Lexer and Parser Generator

In 2001, when I was in graduate school, I defined an extended SQL syntax for querying distributed relational databases, as part of the course project of Distributed RDBMS by Prof. Li-zhu Zhou. I wrote the parser using bison (a modern version of yacc) and flex (a modern version of lex). yacc and lex generate C code; bison and flex generate C++ code. However, this time, I’d use Go.

I surveyed goyacc, a standard Go tool. The usage is very similar to that of yacc and bison. However, the Go toolchain doesn’t provide a tool like lex/flex.

Google revealed golex, which is out of maintenance.

The Medium post recommends Ragel, which is a C++ program and could generate Go lexer; however, it lacks documents.

Handwritten Lexer and Parser

Some documents, including this one recommends handwriting lexers. However, it doesn’t explain how to write the parser.

GoAcademy always provides high-quality tech blog posts. This one is from the author of InfluxDB. However, I stopped at where it explains wrapping a SQL statement as a string by an io.Reader, because it is obvious that we should keep the string as a string so that that token strings could refer to the same memory storage of the SQL statement.

Following a link in the above GoAcademy post, I found Rob Pike’s excellent talk on how to write a lexer in Go in 2011. Many works after that change Rob’s implementation somehow but always lead to longer and less comprehensible codebases.

The Choice

Therefore, I wrote the lexer and parser both following Rob Pike’s idea. After few days work, I realized that:

  1. I should borrow the idea from Rob to represent SQL statements as strings, but not io.Reader as other work do,
  2. but no need to use channels and goroutines at all, and
  3. it is technically intractable to write a SQL lexer/parser manually.

So, I switched to write a lexer manually, and to generate the parser using goyacc. During my work, I referred to this example and the official yacc manual for details about operator association and precedence.