I have got a df with columns 't' for time, 'first' and 'second'. (Here I used numbers in the t column for simplicity but they will be datetime objects or date strings)
t first second
1 grey red
2 green red
3 red red
4 grey green
5 green red
6 grey green
7 green red
8 red red
It can be created like this:
import pandas as pd
dfx = pd.DataFrame(
{
'time': [1,2,3,4,5,6,7,8],
'first': ['grey', 'green', 'red', 'grey', 'green', 'grey', 'green', 'red'],
'second': ['red', 'red', 'red', 'green', 'red', 'green', 'red', 'red']
}
)
I need to select rows where first equals to green and then add the next time where second equals to green also. The resulting df would look like this:
t first t_second
2 green 4
5 green 6
7 green NaN
How could I achieve this?
I found a similar question here but it deals with Boolean values. I didn't perfectly understand the answer but to my understanding it does not work in a similar manner with multiple categorical values. Also I cannot convert to boolean since I will be having multiple options in the color categories.
CodePudding user response:
Let's start by one-hot encoding the second
column:
>>> pd.get_dummies(df.set_index("t")["second"])
green red
t
1 0 1
2 0 1
3 0 1
4 1 0
5 0 1
6 1 0
7 0 1
8 0 1
And then multiply green
and red
with t
so that we essentially break t
into green
and red
:
>>> _ * df["t"].to_numpy()[:, None]
green red
t
1 0 1
2 0 2
3 0 3
4 4 0
5 0 5
6 6 0
7 0 7
8 0 8
Now if we fill the zeros with the next non-zero value, we will get the desired result.
>>> # fill 0 with the next non-zero value #
green red
t
1 4 1
2 4 2
3 4 3
4 4 5
5 6 5
6 6 7
7 nan 7
8 nan 8
This means that as of t = 1
, the next appearance of green
is at t = 4
. We now only need to join this to the original dataframe to get what we want.
Code:
tmp = (
pd.get_dummies(df.set_index("t")["second"])
.mul(df["t"].to_numpy()[:, None])
.replace(0, np.nan)
.bfill()
.rename_axis(columns="second")
.stack()
.rename("t_second")
)
df.merge(tmp, how="left", left_on=["t", "first"], right_on=["t", "second"])
This assumes that t != 0
, which is likely the case for your real data.