Replace values in PySpark column directly without joining-CodePudding

I have two data frames:

data1 = [('Andy', 'male'), ('Julie', 'female'), ('Danny', 'male')]
columns1 = ['name', 'gender']
df1 = spark.createDataFrame(data=data1, schema=columns1)


data2 = [('male', 1), ('female', 2)]
columns2 = ['gender', 'enum']
df2 = spark.createDataFrame(data=data2, schema=columns2)

 ----- ------                                                                   
| name|gender|
 ----- ------ 
| Andy|  male|
|Julie|female|
|Danny|  male|
 ----- ------ 

 ------ ---- 
|gender|enum|
 ------ ---- 
|  male|   1|
|female|   2|
 ------ ----

I am looking to replace column gender in df1 with the enum values from df2. I could do this by:

new_df = df1.join(df2, on='gender', how='inner')

And then drop column gender, and rename column enum in new_df to gender. This is cumbersome and depends on column gender to be of the same name in both df1 and df2.

Is there a way to directly replace the values without these intermediate steps?

CodePudding user response：

Since df2 does not contain more than a few thousand elements, you can collect all the data and write a udf like this:

df2_list = df2.collect()
d = sc.broadcast(dict([(c[0], c[1]) for c in df2_list]))
replace = f.udf(lambda x: d.value[x])

# then you can use replace on any dataframe like this;
df1.withColumn("gender", replace("gender")).show()
 ----- ------ 
| name|gender|
 ----- ------ 
| Andy|     1|
|Julie|     2|
|Danny|     1|
 ----- ------

I am not sure it is simpler, but it is another way at it.

NB: the broad cast is not mandatory, but it will allow the dictionary to be sent to each executor only once instead of every single task.

CodePudding user response：

Without join you would need to provide all the possible mappings yourself

df = df.replace({'male':'1', 'female':'2'}, subset='gender')