I get a StackOverflowException error when I use multiple times withColumn() to update the values of a column in Pyspark.
My code with I got the StackOverflowException was:
df = df.withColumn("element", when(df["element"] == 1,"first").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 2,"second").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 3,"third").otherwise(df["element"]))
df = df.withColumn("element", when(df["element"] == 4,"fourth").otherwise(df["element"]))
The Spark documentation suggests to use the select() function. So I tried:
df = df.select("*", (when(df["element"] == 1,"first")).alias("element"))
df = df.select("*", (when(df["element"] == 2,"second")).alias("element"))
df = df.select("*", (when(df["element"] == 3,"third")).alias("element"))
df = df.select("*", (when(df["element"] == 4,"fourth")).alias("element"))
But I recieve an error because of the column "element" isn't updated, another column with the same name is created. The error is this:
Py4JJavaError: An error occurred while calling o3723.apply.
: org.apache.spark.sql.AnalysisException: Reference 'element' is ambiguous, could be: element, element.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:229)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1282)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1249)
at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
How could I do it?
Thank you in advance!
CodePudding user response:
I think you can use .when
multiple times, then .otherwise
. Also you should name the new column something different so that you don't get an ambiguous column error:
df = df.withColumn("element_new", when(df["element"] == 1,"first").when(df["element"] == 2,"second").when(df["element"] == 3,"third").when(df["element"] == 4,"fourth").otherwise(df["element"]))
Using .select
:
df = df.select("*",when(df["element"] == 1,"first").when(df["element"] == 2,"second").when(df["element"] == 3,"third").when(df["element"] == 4,"fourth").otherwise(df["element"]).alias("element_new"))
Example output:
------- -----------
|element|element_new|
------- -----------
| 1| first|
| 2| second|
| 3| third|
| 4| fourth|
| 5| 5|
------- -----------