Home > Software design >  Create column and use as join key inside of JOIN in PySpark
Create column and use as join key inside of JOIN in PySpark

Time:10-22

I'd like to create a column to use as the join key inside of the join like:

df1.join(df2
        .withColumn('NewDF2Column', SOME_OPERATION)),
        df1['key'] = df2['NewDF2Column'], how = 'left'))

PySpark can never find the NewDF2Column to use as the join key. It works if I create it first in another dataframe, but not dynamically like this. Is it possible? Thank you!

CodePudding user response:

Dataframes are immutable, which means that you need to reassign everytime your variable to get the result from it. In this case, you are creating your NewDF2Column on the first parameter of join operation, but the second parameter where you references NewDF2Column again can't see the changes made before. How to solve it?

First option

# Creating before joining
df2 = df2.withColumn('NewDF2Column', SOME_OPERATION)

output_df = df1.join(df2, df1['key'] = df2['NewDF2Column'], how='left')

Second option

# Creating a column to join with the same name as df1
output_df = df1.join(df2.withColumn('key', SOME_OPERATION), on='key', how='left')
  • Related