Home > Software design >  Overwrite the rows containing NULL in Dataframe by another Dataframe in PySpark
Overwrite the rows containing NULL in Dataframe by another Dataframe in PySpark

Time:10-29

I have two dataframes df and df1, in df I have NULL in some columns, but in df1 I have non-null values for these columns. I just need to overwrite rows where the NULL exists.

The df is below:

 ------------ -------------------- ------- --------------- -------------------- ---------- ------------ 
|          Id|                Name|Country|           City|             Address|  Latitude|   Longitude|
 ------------ -------------------- ------- --------------- -------------------- ---------- ------------ 
| 42949672960|Americana Resort ...|     US|         Dillon|         135 Main St|      null|        null|
| 42949672965|Comfort Inn Delan...|     US|         Deland|400 E Internation...| 29.054737|  -81.297208|
| 60129542147|Ubaa Old Crawford...|     US|    Des Plaines|     5460 N River Rd|      null|        null|

The df1 is below:

 ------------- -------------------- ------- ------------ -------------------- ---------- ------------ 
|           Id|                Name|Country|        City|             Address|  Latitude|   Longitude|
 ------------- -------------------- ------- ------------ -------------------- ---------- ------------ 
|  42949672960|Americana Resort ...|     US|      Dillon|         135 Main St|39.6286685|-106.0451009|
|  60129542147|Ubaa Old Crawford...|     US| Des Plaines|     5460 N River Rd|42.0654049| -87.8916252|
 ------------- -------------------- ------- ------------ -------------------- ---------- ------------ 

I want this result:

 ------------ -------------------- ------- --------------- -------------------- ---------- ------------ 
|          Id|                Name|Country|           City|             Address|  Latitude|   Longitude|
 ------------ -------------------- ------- --------------- -------------------- ---------- ------------ 
| 42949672960|Americana Resort ...|     US|         Dillon|         135 Main St|39.6286685|-106.0451009|
| 42949672965|Comfort Inn Delan...|     US|         Deland|400 E Internation...| 29.054737|  -81.297208|
...
...

CodePudding user response:

You can either left join or inner join them then using coalesce to pick first non-null lat/lon.

df1
 ----------- --------- ---------- 
|         id|      lat|       lon|
 ----------- --------- ---------- 
|42949672960|     null|      null|
|42949672965|29.054737|-81.297208|
|60129542147|     null|      null|
 ----------- --------- ---------- 

df2
 ----------- ---------- ------------ 
|         id|       lat|         lon|
 ----------- ---------- ------------ 
|42949672960|39.6286685|-106.0451009|
|60129542147|42.0654049| -87.8916252|
 ----------- ---------- ------------ 

Join them together

from pyspark.sql import functions as F

(df1
    .join(df2, on=['id'], how='left')
    .select(
        F.col('id'),
        F.coalesce(df1['lat'], df2['lat']).alias('lat'),
        F.coalesce(df1['lon'], df2['lon']).alias('lon')
    )
    .show()
)
#  ----------- ---------- ------------ 
# |         id|       lat|         lon|
#  ----------- ---------- ------------ 
# |42949672965| 29.054737|  -81.297208|
# |60129542147|42.0654049| -87.8916252|
# |42949672960|39.6286685|-106.0451009|
#  ----------- ---------- ------------ 
  • Related