I'm pushing a query down to a server to read data into Databricks as below:
val jdbcUsername = dbutils.secrets.get(scope = "", key = "")
val jdbcPassword = dbutils.secrets.get(scope = "", key = "")
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
val jdbcHostname = ""
val jdbcPort = ...
val jdbcDatabase = ""
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
// define a query to be passed to database to display the tables available for a given DB
val query_results = "(SELECT * FROM INFORMATION_SCHEMA.TABLES) as tables"
// push the query down to the server to retrieve the list of available tables
val table_names = spark.read.jdbc(jdbcUrl, query_results, connectionProperties)
table_names.createOrReplaceTempView("table_names")
Running display(table_names) would provide a list of tables for a given defined database. This is no issue, however when trying to read and join tables from multiple databases in the same server I havent yet found a solution that works.
An example would be:
// define a query to be passed to database to display a result across many tables
val report1_results = "(SELECT a.Field1, b.Field2 FROM database_1 as a left join database_2 as b on a.Field4 == b.Field8) as report1"
// push the query down to the server to retrieve the query result
val report1_results = spark.read.jdbc(jdbcUrl, report1_results, connectionProperties)
report1_results .createOrReplaceTempView("report1_results")
Any pointers appreciated wrt to restructuring this code (equivalent in Python would also be super helpful).
CodePudding user response:
SQL Server uses 3-part naming like database.schema.table
. This example comes from the SQL Server information_schema
docs:
SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, COLUMN_DEFAULT
FROM AdventureWorks2012.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = N'Product';
To query across databases you need to specify all 3 parts in the query being pushed down to SQL Server.
SELECT a.Field1, b.Field2
FROM database_1.schema_1.table_1 as a
LEFT JOIN database_2.schema_2.table_2 as b
on a.Field4 == b.Field8