Python Pandas: resample based on just one of the columns-CodePudding

I have the following data and I'm resampling my data to find out how many bikes arrive at each of the stations every 15 minutes. However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"

Sample data:

id_trip	dtm_start_trip	dtm_end_trip	start_station	end_station
1	2018-10-01 10:15:00	2018-10-01 10:17:00	A	B
2	2018-10-01 10:17:00	2018-10-01 10:18:00	B	A
...	...	...	...	...
999999	2021-12-31 23:58:00	2022-01-01 00:22:00	C	A
1000000	2021-12-31 23:59:00	2022-01-01 00:29:00	A	D

Trial code:

df2 =  df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)

df2= df2.set_index('dtm_end_trip')

df2 = df2.resample('15T').count()

Output I get:

dtm_end_trip	end_station	count
2018-10-01 00:15:00	2	2
2018-10-01 00:30:00	0	0
2018-10-01 00:45:00	1	1
2018-10-01 01:00:00	2	2
2018-10-01 01:15:00	1	1

Desired output:

dtm_end_trip	end_station	count
2018-10-01 00:15:00	A	2
2018-10-01 00:15:00	B	0
2018-10-01 00:15:00	C	1
2018-10-01 00:15:00	D	2
2018-10-01 00:30:00	A	3
2018-10-01 00:30:00	B	2

The count column of the table above was, in this case, constructed with random numbers with the sole purpose of exemplifying the architecture of the desired output.

CodePudding user response：

You can use pd.Grouper like this:

out = df.groupby([
    pd.Grouper(freq='15min', key='dtm_end_trip'),
    'end_station',
]).size()

>>> out
dtm_end_trip         end_station
2018-10-01 10:15:00  A              1
                     B              1
2022-01-01 00:15:00  A              1
                     D              1
dtype: int64

The result is a Series, but you can easily convert it to a DataFrame with the same headings as per your desired output:

>>> out.to_frame('count').reset_index()
         dtm_end_trip end_station  count
0 2018-10-01 10:15:00           A      1
1 2018-10-01 10:15:00           B      1
2 2022-01-01 00:15:00           A      1
3 2022-01-01 00:15:00           D      1

Note: this is the result from the four rows in your sample input data.