Home > Mobile >  How to avoid unique key appear twice in PySpark left join
How to avoid unique key appear twice in PySpark left join

Time:12-14

df_1 column -> |id|pym_cat|sub_status|year|month|day|

df_2 column -> |id|loc_provinsi|loc_kabupaten|loc_kecamatan|

Here's my code

df_join = df_1.join(df_2, df_1.id == df_2.id, "left")

the error message

AnalysisException: "Reference 'id is ambiguous, could be: b.id, id.;"

CodePudding user response:

Use withColumnRenamed to rename overlapping column to a different name, and drop it at the end:

import pyspark.sql.functions as F

df_join = df_1.alias("df_1") \
              .join(df_2.withColumnRenamed("id", "id2").alias("df_2"), 
                    F.col("df_1.id") == F.col("df_2.id2"), 
                    "left") \
              .drop(F.col("df_2.id2"))

Or keep the duplicate column names; but access them by alias to dataframe:

import pyspark.sql.functions as F

df_join = df_1.alias("df_1") \
              .join(df_2.alias("df_2"), 
                    F.col("df_1.id") == F.col("df_2.id"), 
                    "left") \
              .drop(F.col("df_2.id"))

CodePudding user response:

Alias one of the df. You then can drop columns conditionally by prefixing the column with alias

df.join(df1.alias('df1'),how='left',on=df.id==df1.id).drop(df1.id)
  • Related