For sample data frame that can be derived using code below, I want to update the column Offset_Date
such that for any date in column Offset_Date
that is not within column Date
I want to replace that date in Offset_Date
with last available value in column Date
.
data = {"date": ['2021-01-01', '2021-01-03', '2021-01-04', '2021-01-05',
'2021-01-07', '2021-01-09', '2021-01-10', '2021-01-11'],
"offset_date": ['2021-01-02', '2021-01-04', '2021-01-05',
'2021-01-06', '2021-01-08', '2021-01-10',
'2021-01-11', '2021-01-12']}
test_df = pd.DataFrame(data)
test_df['date'] = pd.to_datetime(test_df['date'])
test_df['offset_date'] = pd.to_datetime(test_df['offset_date'])
To explain further in 1st row of above data frame date 2021-01-02
is not within column date
so I want to replace that value with last available value in column date
i.e. 2021-01-01
.
I want to perform a vectorized approach so I tried the following, which lead to incorrect results.
test_df['offset_date_upd'] = np.where(test_df['offset_date'] in test_df['date'].values,
test_df['offset_date'],
test_df[test_df['date'] <= test_df['offset_date']].values.max())
How can I get the below desired output using a vectorized approach?
Desired Output
------------ -------------
| Date | Offset_Date |
------------ -------------
| 2021-01-01 | 2021-01-01 |
| 2021-03-01 | 2021-04-01 |
| 2021-04-01 | 2021-05-01 |
| 2021-05-01 | 2021-05-01 |
| 2021-07-01 | 2021-07-01 |
| 2021-09-01 | 2021-10-01 |
| 2021-10-01 | 2021-11-01 |
| 2021-11-01 | 2021-11-01 |
------------ -------------
CodePudding user response:
IIUC, you can use isin
, mask
, and fillna
:
test_df['offset_date'] = (test_df['offset_date']
.where(test_df['offset_date'].isin(test_df['date']),
test_df['date'])
)
output:
date offset_date
0 2021-01-01 2021-01-01
1 2021-01-03 2021-01-04
2 2021-01-04 2021-01-05
3 2021-01-05 2021-01-05
4 2021-01-07 2021-01-07
5 2021-01-09 2021-01-10
6 2021-01-10 2021-01-11
7 2021-01-11 2021-01-11
CodePudding user response:
The bellow approach should works for your case
test_df["offset_date"] = np.where(
test_df.offset_date.isin(test_df.date),
test_df.offset_date,
test_df.date
)
CodePudding user response:
This is the purpose of Pandas' merge_asof
function.
We have to be specific about which columns are going where. This will work as a left join and in this case, we want 'offset_date'
to represent the "left". Then, for each value in 'offset_date'
, we look for the greatest value in 'date'
that does not exceed that value in 'offset_date'
.
The one gotcha in this approach is that both columns need to be sorted. If this is an issue with the real data, let me know and I'll add a section at the bottom that deals with this.
new_offset_date = pd.merge_asof(
test_df[['offset_date']], # limit `test_df` to just the column I need for the merge
test_df[['date']], # limit `test_df` to just the other column I need
left_on=['offset_date'], # name the columns since they have different names
right_on=['date'] # name the other column as well
)['date']
Now the new dates are in a pandas.Series
, we can see it if we use the assign
method which DOES NOT overwrite your dataframe and you'll need to use test_df = test_df.assign(offset_date=new_offset_date)
to actually persist the new dates in the dataframe.
test_df.assign(offset_date=new_offset_date)
date offset_date
0 2021-01-01 2021-01-01
1 2021-01-03 2021-01-04
2 2021-01-04 2021-01-05
3 2021-01-05 2021-01-05
4 2021-01-07 2021-01-07
5 2021-01-09 2021-01-10
6 2021-01-10 2021-01-11
7 2021-01-11 2021-01-11