Home > front end >  Python Pandas: resample based on just one of the columns
Python Pandas: resample based on just one of the columns

Time:04-10

I have the following data and I'm resampling my data to find out how many bikes arrive at each of the stations every 15 minutes. However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"

Sample data:

id_trip dtm_start_trip dtm_end_trip start_station end_station
1 2018-10-01 10:15:00 2018-10-01 10:17:00 A B
2 2018-10-01 10:17:00 2018-10-01 10:18:00 B A
... ... ... ... ...
999999 2021-12-31 23:58:00 2022-01-01 00:22:00 C A
1000000 2021-12-31 23:59:00 2022-01-01 00:29:00 A D

Trial code:

df2 =  df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)

df2= df2.set_index('dtm_end_trip')

df2 = df2.resample('15T').count()

Output I get:

dtm_end_trip end_station count
2018-10-01 00:15:00 2 2
2018-10-01 00:30:00 0 0
2018-10-01 00:45:00 1 1
2018-10-01 01:00:00 2 2
2018-10-01 01:15:00 1 1

Desired output:

dtm_end_trip end_station count
2018-10-01 00:15:00 A 2
2018-10-01 00:15:00 B 0
2018-10-01 00:15:00 C 1
2018-10-01 00:15:00 D 2
2018-10-01 00:30:00 A 3
2018-10-01 00:30:00 B 2

The count column of the table above was, in this case, constructed with random numbers with the sole purpose of exemplifying the architecture of the desired output.

CodePudding user response:

You can use pd.Grouper like this:

out = df.groupby([
    pd.Grouper(freq='15min', key='dtm_end_trip'),
    'end_station',
]).size()

>>> out
dtm_end_trip         end_station
2018-10-01 10:15:00  A              1
                     B              1
2022-01-01 00:15:00  A              1
                     D              1
dtype: int64

The result is a Series, but you can easily convert it to a DataFrame with the same headings as per your desired output:

>>> out.to_frame('count').reset_index()
         dtm_end_trip end_station  count
0 2018-10-01 10:15:00           A      1
1 2018-10-01 10:15:00           B      1
2 2022-01-01 00:15:00           A      1
3 2022-01-01 00:15:00           D      1

Note: this is the result from the four rows in your sample input data.

  • Related