How to aggregate multiple columns and output as rows?-CodePudding

I'm using pyspark, and I have data like this:

col1	col2	col3
1	0	1
1	1	0
1	1	0
1	0	0

My desired output is:

col	sum
col1	4
col2	2
col3	1

My first thought was to put the column names in a list, loop through it, and each time sum that column and union the results to a new df.

Then I thought, maybe it's possible to do multiple aggregations, e.g.:

df.agg(sum('col1), sum('col2))

... and then figure out a way to unpivot.

Is there an easier way?

CodePudding user response：

You can use stack SQL function to unpivot a dataframe, as described here. So your code would become, with input as your input dataframe:

from pyspark.sql import functions as F

output = input.agg(
  F.sum("col1").alias("col1"),
  F.sum("col2").alias("col2"),
  F.sum("col3").alias("col3")
).select(
  F.expr("stack(3, 'col1', col1, 'col2', col2, 'col3', col3) as (col,sum)")
)

If you have the following input dataframe:

 ---- ---- ---- 
|col1|col2|col3|
 ---- ---- ---- 
|1   |0   |1   |
|1   |1   |0   |
|1   |1   |0   |
|1   |0   |0   |
 ---- ---- ----

You will get the following output dataframe:

 ---- --- 
|col |sum|
 ---- --- 
|col1|4  |
|col2|2  |
|col3|1  |
 ---- ---

CodePudding user response：

There is no easier way as far as I know. You can unpivot it after aggregating, either by first converting it to a Pandas dataframe and then invoking transpose on it or creating a map and then exploding the map to get the result as col and sum column.

# Assuming initial dataframe is df
aggDF = df.agg(*[F.sum(F.col(col_name)).alias(col_name) for col_name in df.columns])

# Using pandas
aggDF.toPandas().transpose().reset_index().rename({'index' : 'col', 0: 'sum'}, axis=1)

# Going spark all the way
aggDF.withColumn("col", F.create_map([e for col in aggDF.columns for e in (F.lit(col), F.col(col))])).selectExpr("explode(col) as (col, sum)").show()

# Both return
"""
 ---- --- 
| col|sum|
 ---- --- 
|col1|  4|
|col2|  2|
|col3|  1|
 ---- --- 
"""

This works for more than 3 columns, if required.

CodePudding user response：

You can first sum each column:

// input
val df = List((1,0,1),(1,1,0),(1,1,0),(1,0,0)).toDF("col1", "col2", "col3")
df.show

// sum each column
val sums = df.agg(sum("coL1").as("col1"), sum("col2").as("col2"), 
sum("col3").as("col3"))
sums.show 

 ---- ---- ---- 
|col1|col2|col3|
 ---- ---- ---- 
|   4|   2|   1|
 ---- ---- ----

This gives you a DS with one row, and 3 columns. Which you can easily collect. And if that's what you want, create a new dataset with:

val sumRow = sums.first

val sumDS = List("col1" -> sumRow.getAs[Long]("col1"), "col2" -> 
sumRow.getAs[Long]("col2"), "col3" -> sumRow.getAs[Long]("col3")).toDF("col", "sum")
sumDS.show

 ---- --- 
| col|sum|
 ---- --- 
|col1|  4|
|col2|  2|
|col3|  1|
 ---- ---