AnalysisException: ambiguous reference error when trying to replace withColumn() with select()-CodePudding

I get a StackOverflowException error when I use multiple times withColumn() to update the values of a column in Pyspark.

My code with I got the StackOverflowException was:

df = df.withColumn("element", when(df["element"] == 1,"first").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 2,"second").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 3,"third").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 4,"fourth").otherwise(df["element"]))

The Spark documentation suggests to use the select() function. So I tried:

df = df.select("*", (when(df["element"] == 1,"first")).alias("element"))
df = df.select("*", (when(df["element"] == 2,"second")).alias("element"))
df = df.select("*", (when(df["element"] == 3,"third")).alias("element"))
df = df.select("*", (when(df["element"] == 4,"fourth")).alias("element"))

But I recieve an error because of the column "element" isn't updated, another column with the same name is created. The error is this:

Py4JJavaError: An error occurred while calling o3723.apply.
: org.apache.spark.sql.AnalysisException: Reference 'element' is ambiguous, could be: element, element.;
    at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
    at org.apache.spark.sql.Dataset.resolve(Dataset.scala:229)
    at org.apache.spark.sql.Dataset.col(Dataset.scala:1282)
    at org.apache.spark.sql.Dataset.apply(Dataset.scala:1249)
    at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

How could I do it?

Thank you in advance!

CodePudding user response：

I think you can use .when multiple times, then .otherwise. Also you should name the new column something different so that you don't get an ambiguous column error:

df = df.withColumn("element_new", when(df["element"] == 1,"first").when(df["element"] == 2,"second").when(df["element"] == 3,"third").when(df["element"] == 4,"fourth").otherwise(df["element"]))

Using .select:

df = df.select("*",when(df["element"] == 1,"first").when(df["element"] == 2,"second").when(df["element"] == 3,"third").when(df["element"] == 4,"fourth").otherwise(df["element"]).alias("element_new"))

Example output:

 ------- ----------- 
|element|element_new|
 ------- ----------- 
|      1|      first|
|      2|     second|
|      3|      third|
|      4|     fourth|
|      5|          5|
 ------- -----------