Is there a way to just truncate a remote mssql table from databricks?-CodePudding

I'm trying to figure out how/if possible to execute a truncate table command in a remote sql server from databricks. I'm using databricks for an ETL script, but it is loading into a remote ms sql server.

The original script truncates the table, and then appends repeatedly to it. It truncates it like:

engine.execution_options(autocommit=True).execute("TRUNCATE TABLE my_table;")

I don't know how to replicate that using pyspark. I'm trying to avoid doing something like:

first_iteration = True
for item in items_to_query:
  df = f(...)
  if first_iteration:
    df.write.option("mode","overwrite")....
    first_iteration = False
  else:
    df.write.option("mode","append")...

it would be nicer if I could have something like

truncate_remote_table("table","database")

for item in items_to_query:
  df = f(...)
  df.write.option("mode","append")....

I hope I'm explaining this well. If you want to recommend just doing it completely differently that's fine. Just I work with a lot of people that are (rightfully) frightened/easily spooked about moving the script to databricks, so I'd really like to change as little as possible at each step. Rock and a hard place.

I've searched on google, but the search results seem to always start with an existing dataframe and then having it do a mode="overwrite" to truncate the table. Nothing is just a simple "TRUNCATE TABLE" command.

CodePudding user response：

I'll preface this by saying I know nothing about any of the technologies you've mentioned other than SQL Server itself. But in each database on a SQL Server, there is a system stored procedure called sp_executesql which allows you to construct a SQL string and execute it.

If you're able to call a stored procedure on the remote SQL Server, then you can call sp_executesql 'USE DatabaseName; TRUNCATE TABLE my_table;' to truncate whatever table you want. Since the database name is specified in a USE clause in the query, then you can execute any instance of the sp_executesql procedure from any database on the server.

CodePudding user response：

You have a couple of options...

Option 01: Truncate Table through the JDBC Connection.

You could create a JDBC Connection to the target SQL Server table. As per below link, you should be able to pass the TRUNCATE statement to the JDBC connection. Replace QUERY statement with truncate statement. Be sure to commit even though it is a truncate statement.

https://www.tutorialspoint.com/jdbc/jdbc-delete-records.htm

Option 02: Pushdown the truncate query

As per answer from Peter Pan in below link, you could use same logic to push the truncate statement to the SQL Server Table. Alias the statement as the dbtable.

https://stackoverflow.com/a/58629994/13280838

Option 03: Use empty dataframe to truncate

As per below link How to truncate a table in PySpark?

you could create an empty dataframe and then use it to truncate the table. Please note that there are some caveats when truncate has not worked as expected for me. Also using overwrite without the truncate - drops the table and the indexes are not recreated as per my limited knowledge, so please be careful while using.

As always please be sure to try and test using a throwaway temp table before implementing anything even in an actual DEV table. Hope it helps.

CodePudding user response：

As it was mentioned in other posts, there are few approaches to that:

Use JDBC directly to execute your code. In PySpark, you will need to go through JVM gateway with something like this (from this answer):

driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(mssql_url, mssql_user, mssql_pass)
connection.prepareCall("TRUNCATE TABLE my_table").execute()
connection.close()

Truncate using the empty Dataframe in overwrite mode as it's pointed by @rainingdistros. The only caveat is that it's by default it uses drop/create new approach, so you lose the index. But this is controlled by setting the truncate option to true instead of default false (see docs):

# get a dataframe with table schema
df = spark.read.jdbc(....)
# truncate the table
df.limit(0).write.mode("overwrite").option("truncate", "true").jdbc(...)