Home > Back-end >  AnalysisException using when in pyspark
AnalysisException using when in pyspark

Time:10-27

I have two dataframes:

df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["name", "age"])
df2 = spark.createDataFrame([("joe", 88), ("luisa", 99)], ["name", "age"])

I want to update the age when the names match. So I thought using a when() would work.

df.withColumn("age", F.when(df.name == df2.name, df2.age)).otherwise(df.age)

but this results in this error:

AnalysisException: Resolved attribute(s) name#181,age#182L missing from name#177,age#178L in operator !Project [name#177, CASE WHEN (name#177 = name#181) THEN age#182L END AS age#724L]. Attribute(s) with the same name appear in the operation: name,age. Please check if the right attribute(s) are used.;

how do resolve this? because when i print the when statement i see this:

Column<'CASE WHEN (name = name) THEN age ELSE age END'>

CodePudding user response:

you need a join :

df.join(df2, how="left", on="name").withColumn("age", F.coalesce(df2.age, df.age))

CodePudding user response:

By "update age", I assume you want latest/greatest age:

df \
  .join(df2, how="inner", on="name") \
  .withColumn("updated_age", F.greatest(df2.age, df.age)) \
  .select("name", F.col("updated_age").alias("age"))

[Out]:
 ----- --- 
| name|age|
 ----- --- 
|  joe| 88|
|luisa| 99|
 ----- --- 
  • Related