I'm using pyspark, and I have data like this:
col1 | col2 | col3 |
---|---|---|
1 | 0 | 1 |
1 | 1 | 0 |
1 | 1 | 0 |
1 | 0 | 0 |
My desired output is:
col | sum |
---|---|
col1 | 4 |
col2 | 2 |
col3 | 1 |
My first thought was to put the column names in a list, loop through it, and each time sum that column and union the results to a new df.
Then I thought, maybe it's possible to do multiple aggregations, e.g.:
df.agg(sum('col1), sum('col2))
... and then figure out a way to unpivot.
Is there an easier way?
CodePudding user response:
You can use stack
SQL function to unpivot a dataframe, as described here. So your code would become, with input
as your input dataframe:
from pyspark.sql import functions as F
output = input.agg(
F.sum("col1").alias("col1"),
F.sum("col2").alias("col2"),
F.sum("col3").alias("col3")
).select(
F.expr("stack(3, 'col1', col1, 'col2', col2, 'col3', col3) as (col,sum)")
)
If you have the following input
dataframe:
---- ---- ----
|col1|col2|col3|
---- ---- ----
|1 |0 |1 |
|1 |1 |0 |
|1 |1 |0 |
|1 |0 |0 |
---- ---- ----
You will get the following output
dataframe:
---- ---
|col |sum|
---- ---
|col1|4 |
|col2|2 |
|col3|1 |
---- ---
CodePudding user response:
There is no easier way as far as I know. You can unpivot it after aggregating, either by first converting it to a Pandas dataframe and then invoking transpose
on it or creating a map and then exploding the map to get the result as col
and sum
column.
# Assuming initial dataframe is df
aggDF = df.agg(*[F.sum(F.col(col_name)).alias(col_name) for col_name in df.columns])
# Using pandas
aggDF.toPandas().transpose().reset_index().rename({'index' : 'col', 0: 'sum'}, axis=1)
# Going spark all the way
aggDF.withColumn("col", F.create_map([e for col in aggDF.columns for e in (F.lit(col), F.col(col))])).selectExpr("explode(col) as (col, sum)").show()
# Both return
"""
---- ---
| col|sum|
---- ---
|col1| 4|
|col2| 2|
|col3| 1|
---- ---
"""
This works for more than 3 columns, if required.
CodePudding user response:
You can first sum each column:
// input
val df = List((1,0,1),(1,1,0),(1,1,0),(1,0,0)).toDF("col1", "col2", "col3")
df.show
// sum each column
val sums = df.agg(sum("coL1").as("col1"), sum("col2").as("col2"),
sum("col3").as("col3"))
sums.show
---- ---- ----
|col1|col2|col3|
---- ---- ----
| 4| 2| 1|
---- ---- ----
This gives you a DS with one row, and 3 columns. Which you can easily collect. And if that's what you want, create a new dataset with:
val sumRow = sums.first
val sumDS = List("col1" -> sumRow.getAs[Long]("col1"), "col2" ->
sumRow.getAs[Long]("col2"), "col3" -> sumRow.getAs[Long]("col3")).toDF("col", "sum")
sumDS.show
---- ---
| col|sum|
---- ---
|col1| 4|
|col2| 2|
|col3| 1|
---- ---