pyspark.sql.utils.AnalysisException: Column ambiguous but no duplicate column names-CodePudding

I'm getting an ambiguous column exception when joining on the id column of a dataframe, but there are no duplicate columns in the dataframe. What could be causing this error to be thrown?

Join operation, where a and input have been processed by other functions:

b = (
        input
        .where(F.col('st').like('%VALUE%'))
        .select('id', 'sii')
    )
a.join(b, b['id'] == a['item'])

Dataframes:

(Pdb) a.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[item#25280L,sii#24665L]

(Pdb) b.explain()
== Physical Plan ==
*(1) Project [id#23711L, sii#24665L]
 - *(1) Filter (isnotnull(st#25022) AND st#25022 LIKE %VALUE%)
    - *(1) Scan ExistingRDD[id#23711L,st#25022,sii#24665L]

Exception:

pyspark.sql.utils.AnalysisException: Column id#23711L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;

If I recreate the dataframe using the same schema, I do not get any errors:

b_clean = spark_session.createDataFrame([], b.schema)
a.join(b_clean, b_clean['id'] == a['item'])

What can I look at to troubleshoot what happened with the original dataframes that would cause the ambiguous column error?

CodePudding user response：

This error and the fact that your sii column has the same id in both tables (i.e. sii#24665L) tells that both a and b dataframes are made using the same source. So, in essence, this makes your join a self join (exactly what the error message tells). In such cases it's recommended to use alias for dataframes. Try this:

a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))

Again, in some systems you may not be able to save your result, as the resulting dataframe will have 2 sii columns. I would recommend to explicitly select only the columns that you need. Renaming columns using alias may also help if you decide that you'll need both the duplicate columns. E.g.:

df = (
    a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
    .select('item',
            'id',
            F.col('a.sii').alias('a_sii')
    )
)