Query Parser And Converter tool
With big-data gaining traction , companies desire to save every inch of their data in a database of their choice , with the old legacy databases not being able to hold such high volumes of data, moving data from a legacy database to a modern warehousing solution is a common ask .
Applications which are very tightly coupled with their databases often find it difficult to integrate with a different database overnight , one challenge being thousands of queries present in the application code which won’t syntactically work with the new database.
On encountering such a situation in our daily jobs, we thought of coming up with a generic library which could parse any given query written in any language into a structured syntax tree (SST) , this structured representation could then be consumed by other pieces of code /libraries which enforce syntaxes of another language on the SST and thus get the query converted to a different language.
Our use-case was converting queries from the Teradata to Bigquery syntax.
Idea was to develop two modules the Queryparser and the Querybuilder.
Parser had to understand the Teradata ql and prepare the SST , the Builder would then consume the SST and convert the query to Bql.
Apart from query conversion , the SST representation can be used for other purposes like :-
- query mining (extract useful information from a query string).
- syntax checking.
To develop the parser we used the “moving cursor” approach , in this approach the query considered as a string is read from index 0 and is classified and broken down into a series of tokens and each token is classified as an operator /operand/a keyword / table-name/column-name etc based on the regex pattern it matches with , would then be attached to the SST.
Let’s take a small example of the how the parsing/conversion works :
Query : select emp.first_name,emp.last_name,dept.department_name from employees emp inner join departments dept on emp.dept_id=dept.dept_id where emp.emp_id=37
In this case parser starts from index0 “select” is a keyword so parser consumes it and looks ahead for the column names , emp.first_name,emp.last_name,dept.department_name (these tokens match the regular expression for a column type, so these tokens are consumed as columns with column type as entity , other column types defined by the parser are literal,function,expression etc each having their own regex definitions)
On encountering from clause , parser looks for the regex for tablename/nested select expressions in the succeeding tokens which optionally could be followed by some joining conditions and keys, these are then parsed and attached to SST,on encountering where clause , parser looks for regex of conditional expressions and binds them to the SST.
To summarize the parsing logic , a query is looked as a set of keywords followed by group of expressions in a defined order .An expression could be a literal string , a column entity, a function , or a nested selected expression .
At any stage if the sequence of the tokens or the order of expressions don’t make any sense to the parser from a syntactical perview, the parser terminates with a exitcode 1.
Finally , the SST representation of the above query would look like :
So far we have parsed and converted the query to a structured tree , but we aren’t done it ,a value add would be in making use of this converted structure.
The queryBuilder module consumes this SST applies syntactical translation functions which is driven by a syntax mapper and final output is the converted query in Bigquery.
A working prototype of the Teradata to Bigquery conversion tool could be found here: https://github.com/2509soumya/TD2BQ
Follow us on github
Soumya Prateek Raul (https://github.com/2509soumya)
Samarth Sah (https://github.com/Samarth08)