I have two dataframes df
and df1
, in df
I have NULL in some columns, but in df1
I have non-null values for these columns. I just need to overwrite rows where the NULL exists.
The df
is below:
------------ -------------------- ------- --------------- -------------------- ---------- ------------
| Id| Name|Country| City| Address| Latitude| Longitude|
------------ -------------------- ------- --------------- -------------------- ---------- ------------
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St| null| null|
| 42949672965|Comfort Inn Delan...| US| Deland|400 E Internation...| 29.054737| -81.297208|
| 60129542147|Ubaa Old Crawford...| US| Des Plaines| 5460 N River Rd| null| null|
The df1
is below:
------------- -------------------- ------- ------------ -------------------- ---------- ------------
| Id| Name|Country| City| Address| Latitude| Longitude|
------------- -------------------- ------- ------------ -------------------- ---------- ------------
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St|39.6286685|-106.0451009|
| 60129542147|Ubaa Old Crawford...| US| Des Plaines| 5460 N River Rd|42.0654049| -87.8916252|
------------- -------------------- ------- ------------ -------------------- ---------- ------------
I want this result:
------------ -------------------- ------- --------------- -------------------- ---------- ------------
| Id| Name|Country| City| Address| Latitude| Longitude|
------------ -------------------- ------- --------------- -------------------- ---------- ------------
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St|39.6286685|-106.0451009|
| 42949672965|Comfort Inn Delan...| US| Deland|400 E Internation...| 29.054737| -81.297208|
...
...
CodePudding user response:
You can either left join or inner join them then using coalesce to pick first non-null lat/lon.
df1
----------- --------- ----------
| id| lat| lon|
----------- --------- ----------
|42949672960| null| null|
|42949672965|29.054737|-81.297208|
|60129542147| null| null|
----------- --------- ----------
df2
----------- ---------- ------------
| id| lat| lon|
----------- ---------- ------------
|42949672960|39.6286685|-106.0451009|
|60129542147|42.0654049| -87.8916252|
----------- ---------- ------------
Join them together
from pyspark.sql import functions as F
(df1
.join(df2, on=['id'], how='left')
.select(
F.col('id'),
F.coalesce(df1['lat'], df2['lat']).alias('lat'),
F.coalesce(df1['lon'], df2['lon']).alias('lon')
)
.show()
)
# ----------- ---------- ------------
# | id| lat| lon|
# ----------- ---------- ------------
# |42949672965| 29.054737| -81.297208|
# |60129542147|42.0654049| -87.8916252|
# |42949672960|39.6286685|-106.0451009|
# ----------- ---------- ------------