I have two dataframes:
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["name", "age"])
df2 = spark.createDataFrame([("joe", 88), ("luisa", 99)], ["name", "age"])
I want to update the age when the names match. So I thought using a when() would work.
df.withColumn("age", F.when(df.name == df2.name, df2.age)).otherwise(df.age)
but this results in this error:
AnalysisException: Resolved attribute(s) name#181,age#182L missing from name#177,age#178L in operator !Project [name#177, CASE WHEN (name#177 = name#181) THEN age#182L END AS age#724L]. Attribute(s) with the same name appear in the operation: name,age. Please check if the right attribute(s) are used.;
how do resolve this? because when i print the when statement i see this:
Column<'CASE WHEN (name = name) THEN age ELSE age END'>
CodePudding user response:
you need a join :
df.join(df2, how="left", on="name").withColumn("age", F.coalesce(df2.age, df.age))
CodePudding user response:
By "update age", I assume you want latest/greatest age:
df \
.join(df2, how="inner", on="name") \
.withColumn("updated_age", F.greatest(df2.age, df.age)) \
.select("name", F.col("updated_age").alias("age"))
[Out]:
----- ---
| name|age|
----- ---
| joe| 88|
|luisa| 99|
----- ---