I need to do a double quotes check in a dataframe. So I am iterating through all the columns for this check but takes lot of time. I am using Azure Databricks for this.
for column in columns_list:
column_name = "`" column "`"
df_reject = source_data.withColumn("flag_quotes",when(source_data[column_name].rlike("[\"\"]"),lit("Yes")).otherwise(lit("No")))
df_quo_rejected_df = df_reject.filter(col("flag_quotes") == "Yes")
df_quo_rejected_df = df_quo_rejected_df.withColumn('Error', lit(err))
df_quo_rejected_df.coalesce(1).write.mode("append").option("header","true")\
.option("delimiter",delimiter)\
.format("com.databricks.spark.csv")\
.save(filelocwrite)
I have got around 500 columns with 40 million records. I tried union the dataframes every iteration but the operation does OOM after sometime. So I save the dataframe and append it every iteration. Please help me with a way to optimize the running time.
CodePudding user response:
Instead of looping through columns you can try checking their values using exists
.
from pyspark.sql import functions as F
columns_list = [f"`{c}`" for c in columns_list]
df_reject = source_data.filter(F.exists(F.array(*columns_list), lambda x: x.rlike("[\"\"]")))
df_cols_add = df_reject.select('*', F.lit('Yes').alias('flag_quotes'), F.lit(err).alias('Error'))