Using a loop over the column list is rather slow. Would there be a way to do it with a single 'table scan'? Example: Source Table:
FRUIT COLOR
Apple Red
Apple Red
Apple Yellow
Pear Yellow
Pear Yellow
Desired output:
COLUMN LEVEL COUNT
FRUIT Apple 3
FRUIT Pear 2
COLOR Red 2
COLOR Yellow 3
Again, the goal is to get the 1-way frequencies of multiple categorical variables in an efficient manner.
CodePudding user response:
leleogere's answer achieves what I was after. I am posting a generalization for reference:
data = [
('Apple', 'Red' , 'small'),
('Apple', 'Red' , 'small'),
('Apple', 'Yellow', 'large'),
('Pear', 'Yellow', 'large'),
('Pear', 'Yellow', 'large'),
('Orange', 'Orange', 'large')]
varlist = ['FRUIT', 'COLOR', 'SIZE']
import pyspark.sql.functions as F
df = spark.createDataFrame(data, varlist)
stack_string = ', '.join([f"\'{v}\', {v}" for v in varlist])
df.select(
F.expr(f"stack({len(varlist)}, {stack_string}) as (COLUMN, LEVEL)")
).groupBy("COLUMN", "LEVEL").count().sort("COLUMN", "LEVEL").show()
CodePudding user response:
This can be done using the stack
SQL function:
import pyspark.sql.functions as F
df.select(
F.expr("stack(2, 'FRUIT', FRUIT, 'COLOR', COLOR) as (COLUMN,LEVEL)")
).groupBy("COLUMN", "LEVEL").count().show()
# ------ ------ -----
# |COLUMN| LEVEL|count|
# ------ ------ -----
# | FRUIT| Pear| 2|
# | COLOR|Yellow| 3|
# | COLOR| Red| 2|
# | FRUIT| Apple| 3|
# ------ ------ -----
CodePudding user response:
s.to_pandas_on_spark()\
.stack()\
.reset_index(level=1)\
.rename(columns={"level_1" : "column", 0: 'level'})\
.groupby(["column", "level"]).size()
column level
FRUIT Pear 2
COLOR Yellow 3
Red 2
FRUIT Apple 3