Home > Back-end >  Spark SQL persistent view over jdbc data source
Spark SQL persistent view over jdbc data source

Time:10-27

I want to create a persistent (global) view in spark sql that gets data from an underlying jdbc database connection. It works fine when I use a temporary (session-scoped) view as shown below but fails when trying to create a regular (persistent and global) view.

I don't understand why the latter should not work but couldn't find any docs/hints as all examples are always done with temporary views. Technically, I cannot see why it shouldn't work as the data is properly retrieved from jdbc source in the temporary view and thus it should not matter if I wanted to "store" the query in a persistent view so that whenever calling the view it would retrieve data directly from jdbc source.

Config.

tbl_in = myjdbctable
tbl_out = myview
db_user = 'myuser'
db_pw = 'mypw'
jdbc_url = 'jdbc:sqlserver://myserver.domain:1433;database=mydb'

This works.

query = f"""
create or replace temporary view {tbl_out}
using jdbc
options(
  dbtable '{tbl_in}',
  user '{db_user}',
  password '{db_pw}',
  url '{jdbc_url}'
  )
"""

spark.sql(query)

> DataFrame[]

This does not work.

query = f"""
create or replace view {tbl_out}
using jdbc
options(
  dbtable '{tbl_in}',
  user '{db_user}',
  password '{db_pw}',
  url '{jdbc_url}'
  )
"""

spark.sql(query)

> ParseException:

Error.

ParseException: 
mismatched input 'using' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 3, pos 0)

== SQL ==

create or replace view myview
using jdbc
^^^
options(
  dbtable 'myjdbctable',
  user 'myuser',
  password '[REDACTED]',
  url 'jdbc:sqlserver://myserver.domain:1433;database=mydb'
  )

CodePudding user response:

TL;DR: A spark sql table over jdbc source behaves like a view and so can be used like one.

It seems my assumptions about jdbc tables in spark sql were flawed. It turns out that a sql table with a jdbc source (i.e. created via using jdbc) is actually a live query against the jdbc source (and not a one-off jdbc query during table creation as I assumed). In my mind it actually behaves like a view then. That means if the underlying jdbc source changes (e.g. new entries in a column) this is reflected in the spark sql table on read (e.g. select from) without having to re-create the table.

It follows that the spark sql table over jdbc source satisfies my requirements of having an always up2date reflection of the underlying table/sql object in the jdbc source. Usually, I would use a view for that. Maybe this is the reason why there is no persistent view over a jdbc source but only temporary views (which of course still make sense as they are session-scoped). It should be noted that the spark sql jdbc table behaves like a view which may be surprising, in particular:

  • if you add a column in underlying jdbc table, it will not show up in spark sql table
  • if you remove a column from underlying jdbc table, an error will occur when spark sql table is accessed (assuming the removed column was present during spark sql table creation)
  • if you remove the underlying jdbc table, an error will occur when spark sql table is accessed

CodePudding user response:

The input of spark.sql should be DML (Data Manipulation Language). Its output is a dataframe.

In terms of best practices, you should avoid using DDL (Data Definition Language) with spark.sql. Even if some statements may work, that's not meant to be used this way.
If you want to use DDL, simply connect to your DB using python packages.

If you want to create a temp view in spark, do it using spark syntaxe createTempView

  • Related