Home > Enterprise >  How can I group 20 columns by month on pyspark?
How can I group 20 columns by month on pyspark?

Time:02-22

I have a dataframe like the one bellow in pyspark, but with a lot more columns and I need to group it by month.

date col_1 col_2 col_3 col_4 col_5
2021-01-01 68.045,60 45.455,64 12.258,66 77.885,69 11.111,69
2021-01-02 98.745,65 88.971,25 23.533,69 9.698,56 45.885,74
2021-03-03 12.336,93 24.669,61 12.458.69 23.456.78 99.478,78

First i did this:

df = df.withColumn('month', date_format(date_trunc('month', col("date")), 'yyyy-MM-dd'))

And now I have the column column month, which can be used to group the data, but i don't know how to group all my columns (there are about 20 of them).

date col_1 col_2 col_3 col_4 col_5 month
2021-01-01 68.045,60 45.455,64 12.258,66 77.885,69 11.111,69 2021-01-01
2021-01-02 98.745,65 88.971,25 23.533,69 9.698,56 45.885,74 2021-01-01
2021-03-03 12.336,93 24.669,61 12.458.69 23.456.78 99.478,78 2021-01-01

Can you guys help me, please?

Thank you!

CodePudding user response:

You could aggregate the columns with a selectExpr like below.

Just recreating the dataset.

dbutils.fs.put('dbfs:/sample/data.csv',"""date col_1 col_2 col_3 col_4 col_5
2021-01-01 68045.60 45455.64 12258.66 77885.69 11111.69
2021-01-02 98745.65 88971.25 23533.69 9698.56 45885.74
2021-03-03 12336.93 24669.61 1245869 2345678 99478.78""",True)
df = (spark.read
        .option('header',True)
        .option('delimiter',' ')
        .csv('dbfs:/sample/data.csv'))
df.show()

 ---------- -------- -------- -------- -------- -------- 
|      date|   col_1|   col_2|   col_3|   col_4|   col_5|
 ---------- -------- -------- -------- -------- -------- 
|2021-01-01|68045.60|45455.64|12258.66|77885.69|11111.69|
|2021-01-02|98745.65|88971.25|23533.69| 9698.56|45885.74|
|2021-03-03|12336.93|24669.61| 1245869| 2345678|99478.78|
 ---------- -------- -------- -------- -------- -------- 

and then

from pyspark.sql.functions import trunc 
cols = [f'col_{x}' for x in range(1,6)]
df_answer = (df.withColumn('first_day_of_month',trunc('date','month')).drop('date')
            .selectExpr("first_day_of_month", f"{' '.join(cols)} as sum")
            ) 
df_answer.show()

 ------------------ ------------------ 
|first_day_of_month|               sum|
 ------------------ ------------------ 
|        2021-01-01|         214757.28|
|        2021-01-01|         266834.89|
|        2021-03-01|        3728032.32|
 ------------------ ------------------ 

In the end, you can group by First_day_of_month :

df_answer.groupBy('first_day_of_month').sum('sum').show()

 ------------------ ------------------ 
|first_day_of_month|          sum(sum)|
 ------------------ ------------------ 
|        2021-01-01|         481592.17|
|        2021-03-01|        3728032.32|
 ------------------ ------------------ 

CodePudding user response:

You can use list comprehension with aggregation expression to apply the summary.

data = [
    ('2021-01-01', 68.04560, 45.45564, 12.25866, 77.88569, 11.11169, 0.2, -1.3),
    ('2021-01-02', 98.74565, 88.97125, 23.53369, 9.69856, 45.88574, 0.6, -2.6),
    ('2021-03-03', 12.33693, 24.66961, 12.45869, 23.45678, 99.47878, -3.0, -4.8)
]
schema = ['date', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col6', 'col7']
df = spark.createDataFrame(data, schema)
result_df = df.groupBy(F.date_format(F.date_trunc('MM', 'date'), 'yyyy-MM-dd').alias('month')) \
    .agg(*[F.sum(c).alias(c) for c in df.columns if c != 'date'])
result_df.show()
  • Related