How to Left Join a Dataframe into an Existing Table in Databricks-CodePudding

I have a delta table in Databricks that I am loading into a dataframe to perform a specific numpy operation:

import pyspark.pandas as ps
df = spark.sql("SELECT Id,Field_to_Transform FROM bronze.Table_A where Field_to_Transform is not null").toPandas()

The operation I perform is to remove special characters:

df['Transformed_Field'] = [''.join(e for e in x if e.isalnum()) for x in df['Field_to_Transform ']]
df.drop(['Field_to_Transform '], axis=1)

So this leaves me with the dataframe "df" which has just the id and the Transformed_Field in it:

Id	Transformed_Field
00A1	12345
00A2	123456
00A3	1234
00A4	1234568

Now I want to left join the df back to bronze.Table_A in databricks by simply joining back on the id field.

What is the most effecient way to join df back to bronze.Table_A? Things I have tried so far:

Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.
Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.

Any help is much appreciated, thank you.

CodePudding user response：

Option 1 - DataFrame API way

The first option is a modification of your first bullet point:

Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.

The merge operation is a Pandas method and not PySpark, hence the performance might not be optimal - docs here. Loading the whole table into DataFrame is the correct approach, that just needs the built-in join method to the required left-join, like this:

table_df = spark.read.table("bronze.Table_A")

# Join
merged_df = table_df.join(df, on="Id", how="left")

Option 2 - SQL way

The second option builds on your second bullet point:

Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.

You can temporarily register DataFrame as a view, and then query it using plain SQL, whether in spark.sql method or any other way. Try doing this:

df.createOrReplaceTempView("transformed_df")

# Then join like this
spark.sql("SELECT * FROM bronze.Table_A ta LEFT JOIN transformed_df tdf ON ta.Id = tdf.Id")