I have two data frames:
data1 = [('Andy', 'male'), ('Julie', 'female'), ('Danny', 'male')]
columns1 = ['name', 'gender']
df1 = spark.createDataFrame(data=data1, schema=columns1)
data2 = [('male', 1), ('female', 2)]
columns2 = ['gender', 'enum']
df2 = spark.createDataFrame(data=data2, schema=columns2)
----- ------
| name|gender|
----- ------
| Andy| male|
|Julie|female|
|Danny| male|
----- ------
------ ----
|gender|enum|
------ ----
| male| 1|
|female| 2|
------ ----
I am looking to replace column gender
in df1
with the enum
values from df2
. I could do this by:
new_df = df1.join(df2, on='gender', how='inner')
And then drop column gender
, and rename column enum
in new_df
to gender
. This is cumbersome and depends on column gender
to be of the same name in both df1
and df2
.
Is there a way to directly replace the values without these intermediate steps?
CodePudding user response:
Since df2
does not contain more than a few thousand elements, you can collect all the data and write a udf like this:
df2_list = df2.collect()
d = sc.broadcast(dict([(c[0], c[1]) for c in df2_list]))
replace = f.udf(lambda x: d.value[x])
# then you can use replace on any dataframe like this;
df1.withColumn("gender", replace("gender")).show()
----- ------
| name|gender|
----- ------
| Andy| 1|
|Julie| 2|
|Danny| 1|
----- ------
I am not sure it is simpler, but it is another way at it.
NB: the broad cast is not mandatory, but it will allow the dictionary to be sent to each executor only once instead of every single task.
CodePudding user response:
Without join you would need to provide all the possible mappings yourself
df = df.replace({'male':'1', 'female':'2'}, subset='gender')