df_1
column -> |id|pym_cat|sub_status|year|month|day|
df_2
column -> |id|loc_provinsi|loc_kabupaten|loc_kecamatan|
Here's my code
df_join = df_1.join(df_2, df_1.id == df_2.id, "left")
the error message
AnalysisException: "Reference 'id is ambiguous, could be: b.id, id.;"
CodePudding user response:
Use withColumnRenamed to rename overlapping column to a different name, and drop it at the end:
import pyspark.sql.functions as F
df_join = df_1.alias("df_1") \
.join(df_2.withColumnRenamed("id", "id2").alias("df_2"),
F.col("df_1.id") == F.col("df_2.id2"),
"left") \
.drop(F.col("df_2.id2"))
Or keep the duplicate column names; but access them by alias to dataframe:
import pyspark.sql.functions as F
df_join = df_1.alias("df_1") \
.join(df_2.alias("df_2"),
F.col("df_1.id") == F.col("df_2.id"),
"left") \
.drop(F.col("df_2.id"))
CodePudding user response:
Alias one of the df. You then can drop columns conditionally by prefixing the column with alias
df.join(df1.alias('df1'),how='left',on=df.id==df1.id).drop(df1.id)