Home > Blockchain >  Merge two dataframes in pandas with common info as columns or as cells
Merge two dataframes in pandas with common info as columns or as cells

Time:12-24

I have two pandas dateframes in python, df_main and df_aux.

df_main is a table which gathers events, with the datetime when it happened and a column "Description" which gives a codified location. It has the following structure:

Date Description
2022-01-01 13:45:23 A
2022-01-01 14:22:00 C
2022-01-01 16:15:33 D
2022-01-01 16:21:22 E
2022-01-02 13:21:56 B
2022-01-02 14:45:41 B
2022-01-02 15:11:34 C

df_aux is a table which gives the number of other events (let's say for example people walking by within Initial_Date and Final_Date) which are happening in each location (A, B, C, D), with a 1-hour granularity. The structure of df_aux is as follows:

Initial_Date Final_Date A B C D
2022-01-01 12:00:00 2022-01-01 12:59:59 2 0 1 2
2022-01-01 13:00:00 2022-01-01 13:59:59 3 2 4 5
2022-01-01 14:00:00 2022-01-01 14:59:59 2 2 7 0
2022-01-01 15:00:00 2022-01-01 15:59:59 5 2 2 0
2022-01-02 12:00:00 2022-01-02 12:59:59 1 1 0 3
2022-01-02 13:00:00 2022-01-02 13:59:59 5 5 0 3
2022-01-02 14:00:00 2022-01-02 14:59:59 2 3 2 1
2022-01-02 15:00:00 2022-01-02 15:59:59 3 4 1 0

So my problem is that I would need to add a new column in df_main to account for the number of people who have walked by in the hour previous to the event. For example, in the first event, which happens at 13:45:23h, we would go to the df_aux and look for the previous hour (12:45:23), which is the first row, as 12:45:23 is between 12:00:00 and 12:59:59. In that time range, column A has a value of 2, so we would add a new column to the df_main, "People_prev_hour", taking the value 2.

Following the same logic, the full df_main would be,

Date Description People_prev_hour
2022-01-01 13:45:23 A 2
2022-01-01 14:22:00 C 4
2022-01-01 16:15:33 D 0
2022-01-01 16:21:22 E NaN
2022-01-02 13:21:56 B 1
2022-01-02 14:45:41 B 5
2022-01-02 15:11:34 F NaN

Datetimes will always be complete between both dfs, but the Description column may not. As seen in the full df_main, two rows have as Description values E and F, which are not in df_aux. Therefore, in those cases a NaN must be present.

I can't think of a way of merging these two dfs into the desired output, as pd.merge uses common columns, and I don't manage to do anything with pd.melt or pd.pivot. Any help is much appreciated!

CodePudding user response:

First idea was use merge_asof, because hours not overlaping intervals:

df1 = pd.merge_asof(df_main, 
                   df_aux.assign(Initial_Date = df_aux['Initial_Date']   pd.Timedelta(1, 'hour')),
                   left_on='Date',
                   right_on='Initial_Date')

Then use indexing lookup:

idx, cols = pd.factorize(df1['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
                 Date Description  People_prev_hour
0 2022-01-01 13:45:23           A               2.0
1 2022-01-01 14:22:00           C               4.0
2 2022-01-01 16:15:33           D               0.0
3 2022-01-01 16:21:22           E               NaN
4 2022-01-02 13:21:56           B               1.0
5 2022-01-02 14:45:41           B               5.0
6 2022-01-02 15:11:34           C               2.0

Another idea with IntervalIndex:

s = pd.IntervalIndex.from_arrays(df_aux.Initial_Date   pd.Timedelta(1, 'hour'),
                                 df_aux.Final_Date   pd.Timedelta(1, 'hour'), 'both')

df1 = df_aux.set_index(s).loc[df_main.Date]

print (df1)
                                                  Initial_Date  \
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:00:00   
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:00:00   
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00   
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:00:00   
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:00:00   
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:00:00   
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:00:00   

                                                    Final_Date  A  B  C  D  
[2022-01-01 13:00:00, 2022-01-01 13:59:59] 2022-01-01 12:59:59  2  0  1  2  
[2022-01-01 14:00:00, 2022-01-01 14:59:59] 2022-01-01 13:59:59  3  2  4  5  
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59  5  2  2  0  
[2022-01-01 16:00:00, 2022-01-01 16:59:59] 2022-01-01 15:59:59  5  2  2  0  
[2022-01-02 13:00:00, 2022-01-02 13:59:59] 2022-01-02 12:59:59  1  1  0  3  
[2022-01-02 14:00:00, 2022-01-02 14:59:59] 2022-01-02 13:59:59  5  5  0  3  
[2022-01-02 15:00:00, 2022-01-02 15:59:59] 2022-01-02 14:59:59  2  3  2  1

idx, cols = pd.factorize(df_main['Description'])
df_main['People_prev_hour'] = (df1.reindex(cols, axis=1).to_numpy() [np.arange(len(df1)), idx])
print (df_main)
                 Date Description  People_prev_hour
0 2022-01-01 13:45:23           A               2.0
1 2022-01-01 14:22:00           C               4.0
2 2022-01-01 16:15:33           D               0.0
3 2022-01-01 16:21:22           E               NaN
4 2022-01-02 13:21:56           B               1.0
5 2022-01-02 14:45:41           B               5.0
6 2022-01-02 15:11:34           C               2.0
  • Related