I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema
since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
- Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record. For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias
instead of as
).
error: value as is not a member of String
destMasterTable.as("t")
- Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema
and autoMerge
.
Is it possible to do so?
CodePudding user response:
To use existing data as a table instead of path you either were need to use saveAsTable
from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable
.
For the second question - it looks like destMasterTable
is just a String
. To refer to existing table, you need to use function forName
from the DeltaTable
object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...