How to apply Self Join on Pandas Data Frame-CodePudding

I am stuck basically joining a Pandas series with a DataFrame.

Let's generate some dummy data using code below

test = pd.DataFrame({'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
                              '2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09', '2021-01-10',
                              '2021-01-11', '2021-01-12', '2021-01-13', '2021-01-14'],
                     'New_Date': [np.nan, '2021-01-01', '2021-01-01', '2021-01-04', '2021-01-03',
                                    '2021-01-06', '2021-01-08', '2021-01-08', '2021-01-09', '2021-01-11',
                                    '2021-01-11', '2021-01-13', '2021-01-13', np.nan],
                     'Price': [1, 1, 5, 3, 4, 3, 2, 5, 6, 4, 3, 2, 1, 7]})
test['Date'] = pd.to_datetime(test['Date'])
test['New_Date'] = pd.to_datetime(test['New_Date'])
test.set_index('Date', inplace=True)

Actual Df

 ------------ ------------ ------- 
|    Date    |  New_Date  | Price |
 ------------ ------------ ------- 
| 01/01/2021 | NaT        |     1 |
| 02/01/2021 | 01/01/2021 |     1 |
| 03/01/2021 | 01/01/2021 |     5 |
| 04/01/2021 | 04/01/2021 |     3 |
| 05/01/2021 | 03/01/2021 |     4 |
| 06/01/2021 | 06/01/2021 |     3 |
| 07/01/2021 | 08/01/2021 |     2 |
| 08/01/2021 | 08/01/2021 |     5 |
| 09/01/2021 | 09/01/2021 |     6 |
| 10/01/2021 | 11/01/2021 |     4 |
| 11/01/2021 | 11/01/2021 |     3 |
| 12/01/2021 | 13/01/2021 |     2 |
| 13/01/2021 | 13/01/2021 |     1 |
| 14/01/2021 | NaT        |     7 |
 ------------ ------------ -------

Desired Output

 ------------ ------------ ------- ----------- 
|    Date    |  New_Date  | Price | New_Price |
 ------------ ------------ ------- ----------- 
| 01/01/2021 | NaT        |     1 | NaN       |
| 02/01/2021 | 01/01/2021 |     1 | 1         |
| 03/01/2021 | 01/01/2021 |     5 | 1         |
| 04/01/2021 | 04/01/2021 |     3 | 3         |
| 05/01/2021 | 03/01/2021 |     4 | 5         |
| 06/01/2021 | 06/01/2021 |     3 | 3         |
| 07/01/2021 | 08/01/2021 |     2 | 5         |
| 08/01/2021 | 08/01/2021 |     5 | 5         |
| 09/01/2021 | 09/01/2021 |     6 | 6         |
| 10/01/2021 | 11/01/2021 |     4 | 3         |
| 11/01/2021 | 11/01/2021 |     3 | 3         |
| 12/01/2021 | 13/01/2021 |     2 | 1         |
| 13/01/2021 | 13/01/2021 |     1 | 1         |
| 14/01/2021 | NaT        |     7 | NaN       |
 ------------ ------------ ------- -----------

I want to create a column New_Price by using New_Date as an index and Joining with Date to get Price which will be named as New_Price.

As per this link

I've tried the solution below:

test['New_Price'] = test['Price'][test['New_Date']].values

The above solution fails due to NaT, this is not basically an actual join so I tried an another way

test.join(test.drop('New_Date', 1), on='New_Date', rsuffix='_y')

This solves the problem as I just need to rename Price_y to New_Price but if there are 20 columns in test df how can I keep columns from left df and only price column named as New_Price which would come from right df. Is there any elegant way to achieve this?

CodePudding user response：

As the dataframe test has already set Date as an index, you can readily use it as a mapping table for looking up New Date from index Date. Then, based on the matching New_Date to Date, we can get the corresponding Price as New_Price (for New_Date).

This can be achieved by using Series.map(), as follows:

test['New_Price'] = test['New_Date'].map(test['Price'])

Result:

print(test)

             New_Date  Price  New_Price
Date                                   
2021-01-01        NaT      1        NaN
2021-01-02 2021-01-01      1        1.0
2021-01-03 2021-01-01      5        1.0
2021-01-04 2021-01-04      3        3.0
2021-01-05 2021-01-03      4        5.0
2021-01-06 2021-01-06      3        3.0
2021-01-07 2021-01-08      2        5.0
2021-01-08 2021-01-08      5        5.0
2021-01-09 2021-01-09      6        6.0
2021-01-10 2021-01-11      4        3.0
2021-01-11 2021-01-11      3        3.0
2021-01-12 2021-01-13      2        1.0
2021-01-13 2021-01-13      1        1.0
2021-01-14        NaT      7        NaN

CodePudding user response：

Let's try join only on the Price column aligned on New_Date:

new_df = test.join(test['Price'].rename('New_Price'), on='New_Date')

new_df

             New_Date  Price  New_Price
Date                                   
2021-01-01        NaT      1        NaN
2021-01-02 2021-01-01      1        1.0
2021-01-03 2021-01-01      5        1.0
2021-01-04 2021-01-04      3        3.0
2021-01-05 2021-01-03      4        5.0
2021-01-06 2021-01-06      3        3.0
2021-01-07 2021-01-08      2        5.0
2021-01-08 2021-01-08      5        5.0
2021-01-09 2021-01-09      6        6.0
2021-01-10 2021-01-11      4        3.0
2021-01-11 2021-01-11      3        3.0
2021-01-12 2021-01-13      2        1.0
2021-01-13 2021-01-13      1        1.0
2021-01-14        NaT      7        NaN