What would be an efficient way of getting the frequencies of all categorical variables in a spark da-CodePudding

Using a loop over the column list is rather slow. Would there be a way to do it with a single 'table scan'? Example: Source Table:

FRUIT  COLOR
Apple  Red
Apple  Red
Apple  Yellow
Pear   Yellow
Pear   Yellow

Desired output:

COLUMN LEVEL   COUNT
FRUIT  Apple   3
FRUIT  Pear    2
COLOR  Red     2
COLOR  Yellow  3

Again, the goal is to get the 1-way frequencies of multiple categorical variables in an efficient manner.

CodePudding user response：

leleogere's answer achieves what I was after. I am posting a generalization for reference:

data = [
('Apple',  'Red'   , 'small'),
('Apple',  'Red'   , 'small'),
('Apple',  'Yellow', 'large'),
('Pear',   'Yellow', 'large'),
('Pear',   'Yellow', 'large'),
('Orange', 'Orange', 'large')]

varlist = ['FRUIT',  'COLOR', 'SIZE']

import pyspark.sql.functions as F
df = spark.createDataFrame(data, varlist)

stack_string = ', '.join([f"\'{v}\', {v}" for v in varlist])

df.select(
  F.expr(f"stack({len(varlist)}, {stack_string}) as (COLUMN, LEVEL)")
).groupBy("COLUMN", "LEVEL").count().sort("COLUMN", "LEVEL").show()

CodePudding user response：

This can be done using the stack SQL function:

import pyspark.sql.functions as F
df.select(
  F.expr("stack(2, 'FRUIT', FRUIT, 'COLOR', COLOR) as (COLUMN,LEVEL)")
).groupBy("COLUMN", "LEVEL").count().show()
#  ------ ------ ----- 
# |COLUMN| LEVEL|count|
#  ------ ------ ----- 
# | FRUIT|  Pear|    2|
# | COLOR|Yellow|    3|
# | COLOR|   Red|    2|
# | FRUIT| Apple|    3|
#  ------ ------ -----

CodePudding user response：

s.to_pandas_on_spark()\
    .stack()\
    .reset_index(level=1)\
    .rename(columns={"level_1" : "column", 0: 'level'})\
    .groupby(["column", "level"]).size()

column  level 
FRUIT   Pear      2
COLOR   Yellow    3
        Red       2
FRUIT   Apple     3