I want to aggregate the following dataframe from minute to 10 minutes. I've been trying to use resample but I realized that I was losing another column. Is it possible to avoid this? I'm guessing this involves some sort of group by as well?
Dataframe looks like this, where date_time is the index, after comes a name column and a count column.
date_time name count
2022-06-03 17:00:00 LoveToKnow 165
2022-06-03 17:00:00 Investing Channel 176
2022-06-03 17:00:00 D17 14
2022-06-03 17:00:00 IFLScience 58
2022-06-03 17:00:00 MotherJones 1
.................
2022-06-03 17:01:00 LoveToKnow 100
2022-06-03 17:01:00 Investing Channel 15
2022-06-03 17:01:00 D17 1
2022-06-03 17:01:00 IFLScience 5
2022-06-03 17:01:00 MotherJones 1
.................
If I were to perform a resample like below, it would group the values and sum them but I would be losing the name column during the process.
df.resample('10min').agg({'count': 'sum'}) or
df.resample('10min').sum()
date_time count
2022-06-03 17:00:00 81816
2022-06-03 17:10:00 77504
2022-06-03 17:20:00 73605
2022-06-03 17:30:00 70994
2022-06-03 17:40:00 68658
The output that I am expecting should be as follows:
date_time name count
2022-06-03 17:00:00 LoveToKnow 1200
2022-06-03 17:00:00 Investing Channel 125
2022-06-03 17:00:00 D17 300
2022-06-03 17:00:00 IFLScience 900
2022-06-03 17:00:00 MotherJones 8
............
2022-06-03 17:10:00 LoveToKnow 700
2022-06-03 17:10:00 Investing Channel 25
2022-06-03 17:10:00 D17 400
2022-06-03 17:10:00 IFLScience 400
2022-06-03 17:10:00 MotherJones 80
.....
CodePudding user response:
I think you need grouping by name
with DataFrame.resample
with aggregate sum
:
df1 = df.groupby('name').resample('10min').sum().swaplevel(1,0).reset_index()
print (df1)
date_time name count
0 2022-06-03 17:00:00 D17 15
1 2022-06-03 17:00:00 IFLScience 63
2 2022-06-03 17:00:00 Investing Channel 191
3 2022-06-03 17:00:00 LoveToKnow 265
4 2022-06-03 17:00:00 MotherJones 2
Or solution with Grouper
:
df1 = df.groupby([pd.Grouper(freq='10Min'), 'name']).sum().reset_index()
print (df1)
date_time name count
0 2022-06-03 17:00:00 D17 15
1 2022-06-03 17:00:00 IFLScience 63
2 2022-06-03 17:00:00 Investing Channel 191
3 2022-06-03 17:00:00 LoveToKnow 265
4 2022-06-03 17:00:00 MotherJones 2
If need only first name
per 10minutes ouput is different:
df2 = df.resample('10min').agg({'name': 'first', 'count': 'sum'})
print (df2)
name count
date_time
2022-06-03 17:00:00 LoveToKnow 536