Home > Software design >  Loop is not working as expected on the pyspark dataframe
Loop is not working as expected on the pyspark dataframe

Time:08-09

I have a function that takes in two parameters, one is a pyspark data frame and the other is a list of variable names from a config file. I am trying to create a loop on the list and check if those variables are null or not in the dataframe. Then append the column with the prefix "_NullCheck". Right now what is happening is that only the last variable in my list shows in the output dataframe. Can someone explain what I am doing wrong.

Here is my code so far.

def nullCheck(df, configfile2):
    nullList = getNullList(configfile2)

    for nullCol in nullList:
        subset_df = df.withColumn(f"{nullCol}_NullCheck",
                        when(df[f"{nullCol}"].isNull(), "Y" )
                        .otherwise("N"))

    return subset_df

CodePudding user response:

def nullCheck(df, configfile2):
nullList = getNullList(configfile2)
subset_df = []
for nullCol in nullList:
    subset_df.append(df.withColumn(f"{nullCol}_NullCheck",
                    when(df[f"{nullCol}"].isNull(), "Y" )
                    .otherwise("N")))

return subset_df

CodePudding user response:

Multiple withColumn() constructs are generally bad if there are a lot of columns (like a lot!). Try using list comprehension within select().

def nullCheck(df, configfile2):
    nullList = getNullList(configfile2)

    subset_df = df. \
        select('*', 
               *[func.when(func.col(nullCol).isNull(), func.lit('Y')).
                 otherwise(func.lit('N')).alias(nullCol '_NullCheck') 
                 for nullCol in nullList]
               )

    return subset_df
  • Related