Home > database >  Replace values in PySpark column directly without joining
Replace values in PySpark column directly without joining

Time:07-05

I have two data frames:

data1 = [('Andy', 'male'), ('Julie', 'female'), ('Danny', 'male')]
columns1 = ['name', 'gender']
df1 = spark.createDataFrame(data=data1, schema=columns1)


data2 = [('male', 1), ('female', 2)]
columns2 = ['gender', 'enum']
df2 = spark.createDataFrame(data=data2, schema=columns2)
 ----- ------                                                                   
| name|gender|
 ----- ------ 
| Andy|  male|
|Julie|female|
|Danny|  male|
 ----- ------ 

 ------ ---- 
|gender|enum|
 ------ ---- 
|  male|   1|
|female|   2|
 ------ ---- 

I am looking to replace column gender in df1 with the enum values from df2. I could do this by:

new_df = df1.join(df2, on='gender', how='inner')

And then drop column gender, and rename column enum in new_df to gender. This is cumbersome and depends on column gender to be of the same name in both df1 and df2.

Is there a way to directly replace the values without these intermediate steps?

CodePudding user response:

Since df2 does not contain more than a few thousand elements, you can collect all the data and write a udf like this:

df2_list = df2.collect()
d = sc.broadcast(dict([(c[0], c[1]) for c in df2_list]))
replace = f.udf(lambda x: d.value[x])

# then you can use replace on any dataframe like this;
df1.withColumn("gender", replace("gender")).show()
 ----- ------ 
| name|gender|
 ----- ------ 
| Andy|     1|
|Julie|     2|
|Danny|     1|
 ----- ------ 

I am not sure it is simpler, but it is another way at it.

NB: the broad cast is not mandatory, but it will allow the dictionary to be sent to each executor only once instead of every single task.

CodePudding user response:

Without join you would need to provide all the possible mappings yourself

df = df.replace({'male':'1', 'female':'2'}, subset='gender')
  • Related