I have a dataframe that looks like this:
Category | Date | Value | PrevDate1 | PrevDate2 |
---|---|---|---|---|
A | 2022-07-15 | 4 | 2022-07-01 | 2022-07-02 |
B | 2022-07-16 | 2 | 2022-07-08 | 2022-07-09 |
B | 2022-07-17 | 6 | 2022-07-05 | 2022-07-16 |
A | 2022-07-18 | 7 | 2022-07-10 | 2022-07-15 |
A | 2022-07-19 | 9 | 2022-07-18 | 2022-07-15 |
I want to add a column that selects Value from another part of the dataframe. If PrevDate1 is more recent than PrevDate2, it should return Value where PrevDate1 is equal to Date. Or equal to PrevDate2 if that is more recent. Each date is unique, so I should just be able to match the previous date with the date column and get the associated value without referring to the category.
The output would look like this:
Category | Date | Value | PrevDate1 | PrevDate2 | PrevVal |
---|---|---|---|---|---|
A | 2022-07-15 | 4 | 2022-07-01 | 2022-07-02 | X |
B | 2022-07-16 | 2 | 2022-07-08 | 2022-07-09 | Y |
B | 2022-07-17 | 6 | 2022-07-05 | 2022-07-16 | 2 |
A | 2022-07-18 | 7 | 2022-07-10 | 2022-07-15 | 4 |
A | 2022-07-19 | 9 | 2022-07-18 | 2022-07-15 | 7 |
I can do this with a loop, but this is slow and I keep seeing that there should be a vectorized solution for everything in pandas. I've tried variations of:
df['PrevVal'] = np.where(df['PrevDate1'] > df['PrevDate2'], df['Value'][df['Date'] == df['PrevDate1'], df['Value'][df['Date'] == df['PrevDate2'])
But this doesn't work, and I can see why it wouldn't. I've been looking for the python equivalent of a correlated subquery in SQL, but haven't been able to find anything.
The SQL code looks something like this:
update table set PrevVal = (select case when PrevDate1 > PrevDate2
then (select Value from table where Date = tb.PrevDate1))
from table tb
This tells it to search Date in the whole table for PrevDate1 in a given row.
CodePudding user response:
Example code
data = {'Category': {0: 'A', 1: 'B', 2: 'B', 3: 'A', 4: 'A'},
'Date': {0: '2022-07-15', 1: '2022-07-16', 2: '2022-07-17', 3: '2022-07-18', 4: '2022-07-19'},
'Value': {0: 4, 1: 2, 2: 6, 3: 7, 4: 9},
'PrevDate1': {0: '2022-07-01', 1: '2022-07-08', 2: '2022-07-05', 3: '2022-07-10', 4: '2022-07-18'},
'PrevDate2': {0: '2022-07-02', 1: '2022-07-09', 2: '2022-07-16', 3: '2022-07-15', 4: '2022-07-15'}}
df = pd.DataFrame(data)
Process
make mapper(it is series) for mapping
mapper = df['Value'].set_axis(df['Date'])
mapper
Date
2022-07-15 4
2022-07-16 2
2022-07-17 6
2022-07-18 7
2022-07-19 9
Name: Value, dtype: int64
mapping by mapper to maximum(most recently) prev date
df[['PrevDate1', 'PrevDate2']].max(axis=1).map(mapper)
result:
0 NaN
1 NaN
2 2.0
3 4.0
4 7.0
Name: PrevDate2, dtype: float64
make result to PrevVal column
Full Code and Output
mapper = df['Value'].set_axis(df['Date'])
df.assign(PrevVal=df[['PrevDate1', 'PrevDate2']].max(axis=1).map(mapper))
output:
Category Date Value PrevDate1 PrevDate2 PrevVal
0 A 2022-07-15 4 2022-07-01 2022-07-02 NaN
1 B 2022-07-16 2 2022-07-08 2022-07-09 NaN
2 B 2022-07-17 6 2022-07-05 2022-07-16 2.0
3 A 2022-07-18 7 2022-07-10 2022-07-15 4.0
4 A 2022-07-19 9 2022-07-18 2022-07-15 7.0
If you need to be different for each category, use merge
instead map