Home > Back-end >  ZetaSQL - Parsing Capabilities and Functionalities
ZetaSQL - Parsing Capabilities and Functionalities

Time:08-24

I am currently working on a lineage system that will be deployed in our google-cloud space, the goal is to extract and parse SQL queries executed from BigQuery using audit logs and create a lineage out of those. I explored a couple of existing products but eventually decided to build my own.

I came across the Google's POC and documentation (https://cloud.google.com/architecture/building-a-bigquery-data-lineage-solution) for data-lineage which led me to explore the usage of zetaSQL.

I was able to parse simple queries and extract referenced tables and output columns from it using Analyzer.extractTableNamesFromStatement(query) and Analyzer.analyzeStatement(query, analyzerOptions, simpleCatalog. However, I was wondering if zetaSQL is capable of achieving these things below, and if so, It would be really helpful if you can give me some examples or path to take.

  • Obtain source columns from ResolvedNodes?
  • if the source table being referenced in the SQL is a derived table, can it also cascade the lineage parsing of it?

additionally, I was also exploring the usage of SimpleCatalog and I was wondering.

  • Why does zetaSQL needs a registration of catalog of tables when the reference tables are already available from the SQL queries? I kind of have a clue it does a lot of checks in background but what are those checks?
  • are those registered catalogs only available at runtime and then wiped out after the job terminates? does it need to register all tables referenced in a SQL or is there a way to cache it somehow?

Thank you.

CodePudding user response:

I'll answer my own question for bullet number 1, it seems it is depending on the type of node. for example, if you are trying to obtain lineage on the functions applied on a column, We could use ResolvedFunctionCall to obtain transformations applied and from there you can do a visit on node ResolvedColumnRef which enables you to get source column for that expression.

  • Related