I have a delta table in Databricks that I am loading into a dataframe to perform a specific numpy operation:
import pyspark.pandas as ps
df = spark.sql("SELECT Id,Field_to_Transform FROM bronze.Table_A where Field_to_Transform is not null").toPandas()
The operation I perform is to remove special characters:
df['Transformed_Field'] = [''.join(e for e in x if e.isalnum()) for x in df['Field_to_Transform ']]
df.drop(['Field_to_Transform '], axis=1)
So this leaves me with the dataframe "df" which has just the id
and the Transformed_Field
in it:
Id | Transformed_Field |
---|---|
00A1 | 12345 |
00A2 | 123456 |
00A3 | 1234 |
00A4 | 1234568 |
Now I want to left join the df back to bronze.Table_A
in databricks by simply joining back on the id
field.
What is the most effecient way to join df
back to bronze.Table_A
?
Things I have tried so far:
- Saved the entire
bronze.Table_A
in a new dataframedf2
, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back. - Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a
df
inside a spark.sql query.
Any help is much appreciated, thank you.
CodePudding user response:
Option 1 - DataFrame API way
The first option is a modification of your first bullet point:
Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.
The merge
operation is a Pandas method and not PySpark, hence the performance might not be optimal - docs here. Loading the whole table into DataFrame is the correct approach, that just needs the built-in join
method to the required left-join, like this:
table_df = spark.read.table("bronze.Table_A")
# Join
merged_df = table_df.join(df, on="Id", how="left")
Option 2 - SQL way
The second option builds on your second bullet point:
Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.
You can temporarily register DataFrame as a view, and then query it using plain SQL, whether in spark.sql
method or any other way. Try doing this:
df.createOrReplaceTempView("transformed_df")
# Then join like this
spark.sql("SELECT * FROM bronze.Table_A ta LEFT JOIN transformed_df tdf ON ta.Id = tdf.Id")