I have two pandas dateframes in python, df_main and df_aux.
df_main is a table which gathers events, with the datetime when it happened and a column "Description" which gives a codified location. It has the following structure:
Date | Description |
---|---|
2022-01-01 13:45:23 | A |
2022-01-01 14:22:00 | C |
2022-01-01 16:15:33 | D |
2022-01-01 16:21:22 | E |
2022-01-02 13:21:56 | B |
2022-01-02 14:45:41 | B |
2022-01-02 15:11:34 | C |
df_aux is a table which gives the number of other events (let's say for example people walking by within Initial_Date and Final_Date) which are happening in each location (A, B, C, D), with a 1-hour granularity. The structure of df_aux is as follows:
Initial_Date | Final_Date | A | B | C | D |
---|---|---|---|---|---|
2022-01-01 12:00:00 | 2022-01-01 12:59:59 | 2 | 0 | 1 | 2 |
2022-01-01 13:00:00 | 2022-01-01 13:59:59 | 3 | 2 | 4 | 5 |
2022-01-01 14:00:00 | 2022-01-01 14:59:59 | 2 | 2 | 7 | 0 |
2022-01-01 15:00:00 | 2022-01-01 15:59:59 | 5 | 2 | 2 | 0 |
2022-01-02 12:00:00 | 2022-01-02 12:59:59 | 1 | 1 | 0 | 3 |
2022-01-02 13:00:00 | 2022-01-02 13:59:59 | 5 | 5 | 0 | 3 |
2022-01-02 14:00:00 | 2022-01-02 14:59:59 | 2 | 3 | 2 | 1 |
2022-01-02 15:00:00 | 2022-01-02 15:59:59 | 3 | 4 | 1 | 0 |
So my problem is that I would need to add a new column in df_main to account for the number of people who have walked by in the hour previous to the event. For example, in the first event, which happens at 13:45:23h, we would go to the df_aux and look for the previous hour (12:45:23), which is the first row, as 12:45:23 is between 12:00:00 and 12:59:59. In that time range, column A has a value of 2, so we would add a new column to the df_main, "People_prev_hour", taking the value 2.
Following the same logic, the full df_main would be,
Date | Description | People_prev_hour |
---|---|---|
2022-01-01 13:45:23 | A | 2 |
2022-01-01 14:22:00 | C | 4 |
2022-01-01 16:15:33 | D | 0 |
2022-01-01 16:21:22 | E | NaN |
2022-01-02 13:21:56 | B | 1 |
2022-01-02 14:45:41 | B | 5 |
2022-01-02 15:11:34 | F | NaN |
Datetimes will always be complete between both dfs, but the Description column may not. As seen in the full df_main, two rows have as Description values E and F, which are not in df_aux. Therefore, in those cases a NaN must be present.
I can't think of a way of merging these two dfs into the desired output, as pd.merge uses common columns, and I don't manage to do anything with pd.melt or pd.pivot. Any help is much appreciated!
CodePudding user response:
First idea was use merge_asof
, because hours not overlaping intervals:
df1 = pd.merge_asof(df_main,
df_aux.assign(Initial_Date = df_aux['Initial_Date'] pd.Timedelta(1, 'hour')),
left_on='Date',
right_on='Initial_Date')
Then use indexing lookup
:
idx, cols = pd.factorize(df1['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0
Another idea with IntervalIndex
:
s = pd.IntervalIndex.from_arrays(df_aux.Initial_Date pd.Timedelta(1, 'hour'),
df_aux.Final_Date pd.Timedelta(1, 'hour'), 'both')
df1 = df_aux.set_index(s).loc[df_main.Date]
print (df1)
Initial_Date \
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:00:00
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:00:00
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:00:00
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:00:00
Final_Date A B C D
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:59:59 2 0 1 2
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:59:59 3 2 4 5
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59 5 2 2 0
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:59:59 1 1 0 3
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:59:59 5 5 0 3
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:59:59 2 3 2 1
idx, cols = pd.factorize(df_main['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
Date Description People_prev_hour
0 2022-01-01 13:45:23 A 2.0
1 2022-01-01 14:22:00 C 4.0
2 2022-01-01 16:15:33 D 0.0
3 2022-01-01 16:21:22 E NaN
4 2022-01-02 13:21:56 B 1.0
5 2022-01-02 14:45:41 B 5.0
6 2022-01-02 15:11:34 C 2.0