Merge two dataframes in pandas with common info as columns or as cells-CodePudding

I have two pandas dateframes in python, df_main and df_aux.

df_main is a table which gathers events, with the datetime when it happened and a column "Description" which gives a codified location. It has the following structure:

Date	Description
2022-01-01 13:45:23	A
2022-01-01 14:22:00	C
2022-01-01 16:15:33	D
2022-01-01 16:21:22	E
2022-01-02 13:21:56	B
2022-01-02 14:45:41	B
2022-01-02 15:11:34	C

df_aux is a table which gives the number of other events (let's say for example people walking by within Initial_Date and Final_Date) which are happening in each location (A, B, C, D), with a 1-hour granularity. The structure of df_aux is as follows:

Initial_Date	Final_Date	A	B	C	D
2022-01-01 12:00:00	2022-01-01 12:59:59	2	0	1	2
2022-01-01 13:00:00	2022-01-01 13:59:59	3	2	4	5
2022-01-01 14:00:00	2022-01-01 14:59:59	2	2	7	0
2022-01-01 15:00:00	2022-01-01 15:59:59	5	2	2	0
2022-01-02 12:00:00	2022-01-02 12:59:59	1	1	0	3
2022-01-02 13:00:00	2022-01-02 13:59:59	5	5	0	3
2022-01-02 14:00:00	2022-01-02 14:59:59	2	3	2	1
2022-01-02 15:00:00	2022-01-02 15:59:59	3	4	1	0

So my problem is that I would need to add a new column in df_main to account for the number of people who have walked by in the hour previous to the event. For example, in the first event, which happens at 13:45:23h, we would go to the df_aux and look for the previous hour (12:45:23), which is the first row, as 12:45:23 is between 12:00:00 and 12:59:59. In that time range, column A has a value of 2, so we would add a new column to the df_main, "People_prev_hour", taking the value 2.

Following the same logic, the full df_main would be,

Date	Description	People_prev_hour
2022-01-01 13:45:23	A	2
2022-01-01 14:22:00	C	4
2022-01-01 16:15:33	D	0
2022-01-01 16:21:22	E	NaN
2022-01-02 13:21:56	B	1
2022-01-02 14:45:41	B	5
2022-01-02 15:11:34	F	NaN

Datetimes will always be complete between both dfs, but the Description column may not. As seen in the full df_main, two rows have as Description values E and F, which are not in df_aux. Therefore, in those cases a NaN must be present.

I can't think of a way of merging these two dfs into the desired output, as pd.merge uses common columns, and I don't manage to do anything with pd.melt or pd.pivot. Any help is much appreciated!

CodePudding user response：

First idea was use merge_asof, because hours not overlaping intervals:

df1 = pd.merge_asof(df_main, 
                   df_aux.assign(Initial_Date = df_aux['Initial_Date']   pd.Timedelta(1, 'hour')),
                   left_on='Date',
                   right_on='Initial_Date')

Then use indexing lookup:

idx, cols = pd.factorize(df1['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
                 Date Description  People_prev_hour
0 2022-01-01 13:45:23           A               2.0
1 2022-01-01 14:22:00           C               4.0
2 2022-01-01 16:15:33           D               0.0
3 2022-01-01 16:21:22           E               NaN
4 2022-01-02 13:21:56           B               1.0
5 2022-01-02 14:45:41           B               5.0
6 2022-01-02 15:11:34           C               2.0

Another idea with IntervalIndex:

s = pd.IntervalIndex.from_arrays(df_aux.Initial_Date   pd.Timedelta(1, 'hour'),
                                 df_aux.Final_Date   pd.Timedelta(1, 'hour'), 'both')

df1 = df_aux.set_index(s).loc[df_main.Date]

print (df1)
                                                  Initial_Date  \
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:00:00   
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:00:00   
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00   
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00   
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:00:00   
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:00:00   
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:00:00   

                                                    Final_Date  A  B  C  D  
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:59:59  2  0  1  2  
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:59:59  3  2  4  5  
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59  5  2  2  0  
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59  5  2  2  0  
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:59:59  1  1  0  3  
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:59:59  5  5  0  3  
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:59:59  2  3  2  1

idx, cols = pd.factorize(df_main['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
                 Date Description  People_prev_hour
0 2022-01-01 13:45:23           A               2.0
1 2022-01-01 14:22:00           C               4.0
2 2022-01-01 16:15:33           D               0.0
3 2022-01-01 16:21:22           E               NaN
4 2022-01-02 13:21:56           B               1.0
5 2022-01-02 14:45:41           B               5.0
6 2022-01-02 15:11:34           C               2.0