I have a function that takes in two parameters, one is a pyspark data frame and the other is a list of variable names from a config file. I am trying to create a loop on the list and check if those variables are null or not in the dataframe. Then append the column with the prefix "_NullCheck". Right now what is happening is that only the last variable in my list shows in the output dataframe. Can someone explain what I am doing wrong.
Here is my code so far.
def nullCheck(df, configfile2):
nullList = getNullList(configfile2)
for nullCol in nullList:
subset_df = df.withColumn(f"{nullCol}_NullCheck",
when(df[f"{nullCol}"].isNull(), "Y" )
.otherwise("N"))
return subset_df
CodePudding user response:
def nullCheck(df, configfile2):
nullList = getNullList(configfile2)
subset_df = []
for nullCol in nullList:
subset_df.append(df.withColumn(f"{nullCol}_NullCheck",
when(df[f"{nullCol}"].isNull(), "Y" )
.otherwise("N")))
return subset_df
CodePudding user response:
Multiple withColumn()
constructs are generally bad if there are a lot of columns (like a lot!). Try using list comprehension within select()
.
def nullCheck(df, configfile2):
nullList = getNullList(configfile2)
subset_df = df. \
select('*',
*[func.when(func.col(nullCol).isNull(), func.lit('Y')).
otherwise(func.lit('N')).alias(nullCol '_NullCheck')
for nullCol in nullList]
)
return subset_df