I'd like to create a column to use as the join key inside of the join like:
df1.join(df2
.withColumn('NewDF2Column', SOME_OPERATION)),
df1['key'] = df2['NewDF2Column'], how = 'left'))
PySpark can never find the NewDF2Column to use as the join key. It works if I create it first in another dataframe, but not dynamically like this. Is it possible? Thank you!
CodePudding user response:
Dataframes are immutable, which means that you need to reassign everytime your variable to get the result from it. In this case, you are creating your NewDF2Column
on the first parameter of join
operation, but the second parameter where you references NewDF2Column
again can't see the changes made before. How to solve it?
First option
# Creating before joining
df2 = df2.withColumn('NewDF2Column', SOME_OPERATION)
output_df = df1.join(df2, df1['key'] = df2['NewDF2Column'], how='left')
Second option
# Creating a column to join with the same name as df1
output_df = df1.join(df2.withColumn('key', SOME_OPERATION), on='key', how='left')