Out of memory while checking string columns and saving error values to Databricks-CodePudding

I need to do a double quotes check in a dataframe. So I am iterating through all the columns for this check but takes lot of time. I am using Azure Databricks for this.

for column in columns_list:
      column_name = "`"   column   "`"
      df_reject = source_data.withColumn("flag_quotes",when(source_data[column_name].rlike("[\"\"]"),lit("Yes")).otherwise(lit("No")))     
      df_quo_rejected_df = df_reject.filter(col("flag_quotes") == "Yes") 
      
     
      df_quo_rejected_df = df_quo_rejected_df.withColumn('Error', lit(err))
      df_quo_rejected_df.coalesce(1).write.mode("append").option("header","true")\
                  .option("delimiter",delimiter)\
                  .format("com.databricks.spark.csv")\
                  .save(filelocwrite)

I have got around 500 columns with 40 million records. I tried union the dataframes every iteration but the operation does OOM after sometime. So I save the dataframe and append it every iteration. Please help me with a way to optimize the running time.

CodePudding user response：

Instead of looping through columns you can try checking their values using exists.

from pyspark.sql import functions as F

columns_list = [f"`{c}`" for c in columns_list]
df_reject = source_data.filter(F.exists(F.array(*columns_list), lambda x: x.rlike("[\"\"]")))
df_cols_add = df_reject.select('*', F.lit('Yes').alias('flag_quotes'), F.lit(err).alias('Error'))