How to get the timestamp of the next row meeting a condition after a specific row in pandas-CodePudding

I have got a df with columns 't' for time, 'first' and 'second'. (Here I used numbers in the t column for simplicity but they will be datetime objects or date strings)

t   first   second
1   grey    red
2   green   red
3   red     red
4   grey    green
5   green   red
6   grey    green
7   green   red
8   red     red

It can be created like this:

import pandas as pd

dfx = pd.DataFrame(
    {
        'time': [1,2,3,4,5,6,7,8],
        'first': ['grey', 'green', 'red', 'grey', 'green', 'grey', 'green', 'red'], 
        'second': ['red', 'red', 'red', 'green', 'red', 'green', 'red', 'red']
    }
)

I need to select rows where first equals to green and then add the next time where second equals to green also. The resulting df would look like this:

t   first   t_second
2   green   4
5   green   6
7   green   NaN

How could I achieve this?

I found a similar question here but it deals with Boolean values. I didn't perfectly understand the answer but to my understanding it does not work in a similar manner with multiple categorical values. Also I cannot convert to boolean since I will be having multiple options in the color categories.

CodePudding user response：

Let's start by one-hot encoding the second column:

>>> pd.get_dummies(df.set_index("t")["second"])

   green  red
t            
1      0    1
2      0    1
3      0    1
4      1    0
5      0    1
6      1    0
7      0    1
8      0    1

And then multiply green and red with t so that we essentially break t into green and red:

>>> _ * df["t"].to_numpy()[:, None]

   green  red
t            
1      0    1
2      0    2
3      0    3
4      4    0
5      0    5
6      6    0
7      0    7
8      0    8

Now if we fill the zeros with the next non-zero value, we will get the desired result.

>>> # fill 0 with the next non-zero value #

   green  red
t            
1      4    1
2      4    2
3      4    3
4      4    5
5      6    5
6      6    7
7    nan    7
8    nan    8

This means that as of t = 1, the next appearance of green is at t = 4. We now only need to join this to the original dataframe to get what we want.

Code:

tmp = (
    pd.get_dummies(df.set_index("t")["second"])
    .mul(df["t"].to_numpy()[:, None])
    .replace(0, np.nan)
    .bfill()
    .rename_axis(columns="second")
    .stack()
    .rename("t_second")
)

df.merge(tmp, how="left", left_on=["t", "first"], right_on=["t", "second"])

This assumes that t != 0, which is likely the case for your real data.