I have the following data and I'm resampling my data to find out how many bikes arrive at each of the stations every 15 minutes. However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"
Sample data:
id_trip | dtm_start_trip | dtm_end_trip | start_station | end_station |
---|---|---|---|---|
1 | 2018-10-01 10:15:00 | 2018-10-01 10:17:00 | A | B |
2 | 2018-10-01 10:17:00 | 2018-10-01 10:18:00 | B | A |
... | ... | ... | ... | ... |
999999 | 2021-12-31 23:58:00 | 2022-01-01 00:22:00 | C | A |
1000000 | 2021-12-31 23:59:00 | 2022-01-01 00:29:00 | A | D |
Trial code:
df2 = df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)
df2= df2.set_index('dtm_end_trip')
df2 = df2.resample('15T').count()
Output I get:
dtm_end_trip | end_station | count |
---|---|---|
2018-10-01 00:15:00 | 2 | 2 |
2018-10-01 00:30:00 | 0 | 0 |
2018-10-01 00:45:00 | 1 | 1 |
2018-10-01 01:00:00 | 2 | 2 |
2018-10-01 01:15:00 | 1 | 1 |
Desired output:
dtm_end_trip | end_station | count |
---|---|---|
2018-10-01 00:15:00 | A | 2 |
2018-10-01 00:15:00 | B | 0 |
2018-10-01 00:15:00 | C | 1 |
2018-10-01 00:15:00 | D | 2 |
2018-10-01 00:30:00 | A | 3 |
2018-10-01 00:30:00 | B | 2 |
The count column of the table above was, in this case, constructed with random numbers with the sole purpose of exemplifying the architecture of the desired output.
CodePudding user response:
You can use pd.Grouper
like this:
out = df.groupby([
pd.Grouper(freq='15min', key='dtm_end_trip'),
'end_station',
]).size()
>>> out
dtm_end_trip end_station
2018-10-01 10:15:00 A 1
B 1
2022-01-01 00:15:00 A 1
D 1
dtype: int64
The result is a Series
, but you can easily convert it to a DataFrame
with the same headings as per your desired output:
>>> out.to_frame('count').reset_index()
dtm_end_trip end_station count
0 2018-10-01 10:15:00 A 1
1 2018-10-01 10:15:00 B 1
2 2022-01-01 00:15:00 A 1
3 2022-01-01 00:15:00 D 1
Note: this is the result from the four rows in your sample input data.