Home > Mobile >  Vectorized solution for getting a value from another row in the same dataframe
Vectorized solution for getting a value from another row in the same dataframe

Time:12-11

I have a dataframe that looks like this:

Category Date Value PrevDate1 PrevDate2
A 2022-07-15 4 2022-07-01 2022-07-02
B 2022-07-16 2 2022-07-08 2022-07-09
B 2022-07-17 6 2022-07-05 2022-07-16
A 2022-07-18 7 2022-07-10 2022-07-15
A 2022-07-19 9 2022-07-18 2022-07-15

I want to add a column that selects Value from another part of the dataframe. If PrevDate1 is more recent than PrevDate2, it should return Value where PrevDate1 is equal to Date. Or equal to PrevDate2 if that is more recent. Each date is unique, so I should just be able to match the previous date with the date column and get the associated value without referring to the category.

The output would look like this:

Category Date Value PrevDate1 PrevDate2 PrevVal
A 2022-07-15 4 2022-07-01 2022-07-02 X
B 2022-07-16 2 2022-07-08 2022-07-09 Y
B 2022-07-17 6 2022-07-05 2022-07-16 2
A 2022-07-18 7 2022-07-10 2022-07-15 4
A 2022-07-19 9 2022-07-18 2022-07-15 7

I can do this with a loop, but this is slow and I keep seeing that there should be a vectorized solution for everything in pandas. I've tried variations of:

df['PrevVal'] = np.where(df['PrevDate1'] > df['PrevDate2'], df['Value'][df['Date'] == df['PrevDate1'], df['Value'][df['Date'] == df['PrevDate2'])

But this doesn't work, and I can see why it wouldn't. I've been looking for the python equivalent of a correlated subquery in SQL, but haven't been able to find anything.

The SQL code looks something like this:

update table set PrevVal = (select case when PrevDate1 > PrevDate2
then (select Value from table where Date = tb.PrevDate1)) 
from table tb

This tells it to search Date in the whole table for PrevDate1 in a given row.

CodePudding user response:

Example code

data = {'Category': {0: 'A', 1: 'B', 2: 'B', 3: 'A', 4: 'A'},
        'Date': {0: '2022-07-15', 1: '2022-07-16', 2: '2022-07-17', 3: '2022-07-18', 4: '2022-07-19'},
        'Value': {0: 4, 1: 2, 2: 6, 3: 7, 4: 9},
        'PrevDate1': {0: '2022-07-01', 1: '2022-07-08', 2: '2022-07-05', 3: '2022-07-10', 4: '2022-07-18'},
        'PrevDate2': {0: '2022-07-02', 1: '2022-07-09', 2: '2022-07-16', 3: '2022-07-15', 4: '2022-07-15'}}
df = pd.DataFrame(data)

Process

make mapper(it is series) for mapping

mapper = df['Value'].set_axis(df['Date'])

mapper

Date
2022-07-15    4
2022-07-16    2
2022-07-17    6
2022-07-18    7
2022-07-19    9
Name: Value, dtype: int64

mapping by mapper to maximum(most recently) prev date

df[['PrevDate1', 'PrevDate2']].max(axis=1).map(mapper)

result:

0   NaN
1   NaN
2   2.0
3   4.0
4   7.0
Name: PrevDate2, dtype: float64

make result to PrevVal column



Full Code and Output

mapper = df['Value'].set_axis(df['Date'])
df.assign(PrevVal=df[['PrevDate1', 'PrevDate2']].max(axis=1).map(mapper))

output:

    Category    Date        Value   PrevDate1   PrevDate2   PrevVal
0   A           2022-07-15  4       2022-07-01  2022-07-02  NaN
1   B           2022-07-16  2       2022-07-08  2022-07-09  NaN
2   B           2022-07-17  6       2022-07-05  2022-07-16  2.0
3   A           2022-07-18  7       2022-07-10  2022-07-15  4.0
4   A           2022-07-19  9       2022-07-18  2022-07-15  7.0

If you need to be different for each category, use merge instead map

  • Related