I have a dataframe like the one bellow in pyspark, but with a lot more columns and I need to group it by month.
date | col_1 | col_2 | col_3 | col_4 | col_5 |
---|---|---|---|---|---|
2021-01-01 | 68.045,60 | 45.455,64 | 12.258,66 | 77.885,69 | 11.111,69 |
2021-01-02 | 98.745,65 | 88.971,25 | 23.533,69 | 9.698,56 | 45.885,74 |
2021-03-03 | 12.336,93 | 24.669,61 | 12.458.69 | 23.456.78 | 99.478,78 |
First i did this:
df = df.withColumn('month', date_format(date_trunc('month', col("date")), 'yyyy-MM-dd'))
And now I have the column column month, which can be used to group the data, but i don't know how to group all my columns (there are about 20 of them).
date | col_1 | col_2 | col_3 | col_4 | col_5 | month |
---|---|---|---|---|---|---|
2021-01-01 | 68.045,60 | 45.455,64 | 12.258,66 | 77.885,69 | 11.111,69 | 2021-01-01 |
2021-01-02 | 98.745,65 | 88.971,25 | 23.533,69 | 9.698,56 | 45.885,74 | 2021-01-01 |
2021-03-03 | 12.336,93 | 24.669,61 | 12.458.69 | 23.456.78 | 99.478,78 | 2021-01-01 |
Can you guys help me, please?
Thank you!
CodePudding user response:
You could aggregate the columns with a selectExpr
like below.
Just recreating the dataset.
dbutils.fs.put('dbfs:/sample/data.csv',"""date col_1 col_2 col_3 col_4 col_5
2021-01-01 68045.60 45455.64 12258.66 77885.69 11111.69
2021-01-02 98745.65 88971.25 23533.69 9698.56 45885.74
2021-03-03 12336.93 24669.61 1245869 2345678 99478.78""",True)
df = (spark.read
.option('header',True)
.option('delimiter',' ')
.csv('dbfs:/sample/data.csv'))
df.show()
---------- -------- -------- -------- -------- --------
| date| col_1| col_2| col_3| col_4| col_5|
---------- -------- -------- -------- -------- --------
|2021-01-01|68045.60|45455.64|12258.66|77885.69|11111.69|
|2021-01-02|98745.65|88971.25|23533.69| 9698.56|45885.74|
|2021-03-03|12336.93|24669.61| 1245869| 2345678|99478.78|
---------- -------- -------- -------- -------- --------
and then
from pyspark.sql.functions import trunc
cols = [f'col_{x}' for x in range(1,6)]
df_answer = (df.withColumn('first_day_of_month',trunc('date','month')).drop('date')
.selectExpr("first_day_of_month", f"{' '.join(cols)} as sum")
)
df_answer.show()
------------------ ------------------
|first_day_of_month| sum|
------------------ ------------------
| 2021-01-01| 214757.28|
| 2021-01-01| 266834.89|
| 2021-03-01| 3728032.32|
------------------ ------------------
In the end, you can group by First_day_of_month :
df_answer.groupBy('first_day_of_month').sum('sum').show()
------------------ ------------------
|first_day_of_month| sum(sum)|
------------------ ------------------
| 2021-01-01| 481592.17|
| 2021-03-01| 3728032.32|
------------------ ------------------
CodePudding user response:
You can use list comprehension with aggregation expression to apply the summary.
data = [
('2021-01-01', 68.04560, 45.45564, 12.25866, 77.88569, 11.11169, 0.2, -1.3),
('2021-01-02', 98.74565, 88.97125, 23.53369, 9.69856, 45.88574, 0.6, -2.6),
('2021-03-03', 12.33693, 24.66961, 12.45869, 23.45678, 99.47878, -3.0, -4.8)
]
schema = ['date', 'col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col6', 'col7']
df = spark.createDataFrame(data, schema)
result_df = df.groupBy(F.date_format(F.date_trunc('MM', 'date'), 'yyyy-MM-dd').alias('month')) \
.agg(*[F.sum(c).alias(c) for c in df.columns if c != 'date'])
result_df.show()