Home > Net >  Push a SQL query to a server from JDBC connection which reads from multiple databases within that se
Push a SQL query to a server from JDBC connection which reads from multiple databases within that se

Time:04-27

I'm pushing a query down to a server to read data into Databricks as below:

val jdbcUsername = dbutils.secrets.get(scope = "", key = "")
val jdbcPassword = dbutils.secrets.get(scope = "", key = "")
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")


val jdbcHostname = "" 
val jdbcPort = ...
val jdbcDatabase = ""

// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"

// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()

connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")

val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)


// define a query to be passed to database to display the tables available for a given DB
val query_results = "(SELECT * FROM INFORMATION_SCHEMA.TABLES) as tables"

// push the query down to the server to retrieve the list of available tables
val table_names = spark.read.jdbc(jdbcUrl, query_results, connectionProperties)
table_names.createOrReplaceTempView("table_names")

Running display(table_names) would provide a list of tables for a given defined database. This is no issue, however when trying to read and join tables from multiple databases in the same server I havent yet found a solution that works.

An example would be:

// define a query to be passed to database to display a result across many tables
val report1_results = "(SELECT a.Field1, b.Field2 FROM database_1 as a left join database_2 as b on a.Field4 == b.Field8) as report1"

// push the query down to the server to retrieve the query result
val report1_results = spark.read.jdbc(jdbcUrl, report1_results, connectionProperties)
report1_results .createOrReplaceTempView("report1_results")

Any pointers appreciated wrt to restructuring this code (equivalent in Python would also be super helpful).

CodePudding user response:

SQL Server uses 3-part naming like database.schema.table. This example comes from the SQL Server information_schema docs:

SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLUMN_DEFAULT
FROM AdventureWorks2012.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = N'Product';

To query across databases you need to specify all 3 parts in the query being pushed down to SQL Server.

SELECT a.Field1, b.Field2 
FROM      database_1.schema_1.table_1 as a 
LEFT JOIN database_2.schema_2.table_2 as b 
       on a.Field4 == b.Field8 
  • Related