Home > Software engineering >  pyspark write failed with StackOverflowError
pyspark write failed with StackOverflowError

Time:11-10

I was planning to convert fixed-width to Parquet in AWS Glue, my data has around 1600 columns, and around 3000 rows. Seems like when i try to write the spark dataframe (in parquet), I am getting "StackOverflow" issue.
Issue is seen even when i do count(), show() etc. I tried calling cache(),repartition() but still see this error.

Code works if I reduce number of columns to 500.

Please help

below is my code

    data_df = spark.read.text(input_path) 

    schema_df = pd.read_json(schema_path)
    df = data_df

    for r in schema_df.itertuples():
        df = df.withColumn(
            str(r.name), df.value.substr(int(r.start), int(r.length))
        )
    df = df.drop("value")

    df.write.mode("overwrite").option("compression", "gzip").parquet(output_path) # FAILING HERE

Stack trace below.

> 
2021-11-10 05:00:13,542 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
  File "/tmp/conv_fw_2_pq.py", line 148, in <module>
    partition_ts=parsed_args.partition_timestamp,
  File "/tmp/conv_fw_2_pq.py", line 125, in process_file
    df.write.mode("overwrite").option("compression", "gzip").parquet(output_path)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 839, in parquet
    self._jwrite.parquet(path)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
**py4j.protocol.Py4JJavaError: An error occurred while calling o7066.parquet.
: java.lang.StackOverflowError**
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at org.apache.spark.sql.catalyst.expressions.Expression.references(Expression.scala:88)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$references$1.apply(Expression.scala:88)
    at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$references$1.apply(Expression.scala:88)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at org.apache.spark.sql.catalyst.expressions.Expression.references(Expression.scala:88)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$references$1.apply(QueryPlan.scala:45)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$references$1.apply(QueryPlan.scala:45)
    at scala.collection.immutable.Stream$$anonfun$flatMap$1.apply(Stream.scala:497)
    at scala.collection.immutable.Stream$$anonfun$flatMap$1.apply(Stream.scala:497)

CodePudding user response:

The official Spark documentation has the following description: This method(withColumn) introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use **select()** with the multiple columns at once.

It is recommended that you first construct the select list, and then use the select method to construct a new dataframe.

  • Related