I have a table (TableA) in a SQL database stored in a Server, accessible via Microsoft SQL Server Management Studio.
Then I have a Databricks notebook which creates a table (TableB) which is then appended to the one stored in the server (Table A).
To append TableB to TableA I use spark:
df_tableB.write.format("jdbc") \
.mode('append') \
.option("url", db_jdbc_url) \
.option("driver", driver) \
.option("dbtable", table_name) \
.option("user", db_user) \
.option("password", db_password) \
.save()
This works perfectly if the schema of TableA and TableB is the same. However I find out that now my TableB could have a slightly different schema in particular, there could be additional columns.
Therefore I wonder if there is a mode to append the tables so that all columns in common are appended as they are right now, and the new ones are appended as well displaying maybe "None". Would you be able to propose a smart and elegant way to achieve my goal?
CodePudding user response:
Read the schema of TableA
and select only those columns from TableB
:
df_tableA = spark.read.format("jdbc").option(...)...load(...)
columns = [F.col(column_name) if column_name in df_tableB.schema.names else F.lit(None).alias(column_name) for column_name in df_tableA.schema.names]
df_tableB.select(columns).write.format("jdbc") \
.mode('append') \
.option("url", db_jdbc_url) \
.option("driver", driver) \
.option("dbtable", table_name) \
.option("user", db_user) \
.option("password", db_password) \
.save()
This way only columns that are present in TableA
are selected and the order will be correct.