Home > database >  Pandas - resample timeseries without loss of information
Pandas - resample timeseries without loss of information

Time:06-10

I want to aggregate the following dataframe from minute to 10 minutes. I've been trying to use resample but I realized that I was losing another column. Is it possible to avoid this? I'm guessing this involves some sort of group by as well?

Dataframe looks like this, where date_time is the index, after comes a name column and a count column.

date_time           name        count
2022-06-03 17:00:00 LoveToKnow  165
2022-06-03 17:00:00 Investing Channel   176
2022-06-03 17:00:00 D17 14
2022-06-03 17:00:00 IFLScience  58
2022-06-03 17:00:00 MotherJones 1
.................
2022-06-03 17:01:00 LoveToKnow  100
2022-06-03 17:01:00 Investing Channel   15
2022-06-03 17:01:00 D17 1
2022-06-03 17:01:00 IFLScience  5
2022-06-03 17:01:00 MotherJones 1
.................

If I were to perform a resample like below, it would group the values and sum them but I would be losing the name column during the process.

df.resample('10min').agg({'count': 'sum'}) or
df.resample('10min').sum()


date_time           count
2022-06-03 17:00:00 81816
2022-06-03 17:10:00 77504
2022-06-03 17:20:00 73605
2022-06-03 17:30:00 70994
2022-06-03 17:40:00 68658

The output that I am expecting should be as follows:

date_time           name        count
2022-06-03 17:00:00 LoveToKnow  1200
2022-06-03 17:00:00 Investing Channel   125
2022-06-03 17:00:00 D17 300
2022-06-03 17:00:00 IFLScience  900
2022-06-03 17:00:00 MotherJones 8
............
2022-06-03 17:10:00 LoveToKnow  700
2022-06-03 17:10:00 Investing Channel   25
2022-06-03 17:10:00 D17 400
2022-06-03 17:10:00 IFLScience  400
2022-06-03 17:10:00 MotherJones 80
.....

CodePudding user response:

I think you need grouping by name with DataFrame.resample with aggregate sum:

df1 = df.groupby('name').resample('10min').sum().swaplevel(1,0).reset_index()
print (df1)
            date_time               name  count
0 2022-06-03 17:00:00                D17     15
1 2022-06-03 17:00:00         IFLScience     63
2 2022-06-03 17:00:00  Investing Channel    191
3 2022-06-03 17:00:00         LoveToKnow    265
4 2022-06-03 17:00:00        MotherJones      2

Or solution with Grouper:

df1 = df.groupby([pd.Grouper(freq='10Min'), 'name']).sum().reset_index()
print (df1)
            date_time               name  count
0 2022-06-03 17:00:00                D17     15
1 2022-06-03 17:00:00         IFLScience     63
2 2022-06-03 17:00:00  Investing Channel    191
3 2022-06-03 17:00:00         LoveToKnow    265
4 2022-06-03 17:00:00        MotherJones      2

If need only first name per 10minutes ouput is different:

df2 = df.resample('10min').agg({'name': 'first', 'count': 'sum'})
print (df2)
                           name  count
date_time                             
2022-06-03 17:00:00  LoveToKnow    536
  • Related