Pyspark parallelize column wise operations in python-CodePudding

I have dataframe with columns having some string values.

col1|col2
---------
aaa |bbb
ccc |ddd
aaa |ddd
eee |fff

I have to get number of allowed values ({aaa,ddd}) present in each columns.

cond = "`col1` = 'aaa' OR `col1` = 'ddd'"
dataframe.where(F.expr(cond)).count()

By this way we are getting required values. We are looping through all columns and perform this operation on each column.

This approach takes hours to process when number of columns increased to 2000.

Is there a better and faster approach for processing all columns parallely?

CodePudding user response：

One alternative is to use list comprehension in Python to apply the same condition on all columns of the dataframe

import pyspark.sql.functions as F

ok_values = ['aaa', 'ddd']
dataframe = dataframe.select(
  *[F.sum((F.col(c).isin(ok_values)).cast('integer')).alias(c) for c in dataframe.columns]
)

dataframe.show()
 ---- ---- 
|col1|col2|
 ---- ---- 
|   2|   2|
 ---- ----