Home > Enterprise >  Pandas: Creating multiple indicator columns after condition with dates
Pandas: Creating multiple indicator columns after condition with dates

Time:04-29

So I have a data set with about 70,000 data points, and I'm trying to test out some code on a sample data set to make sure it will work on the large one. The sample data set follows this format:

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'cond': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
   'time':  ['2009-07-09 15:00:00', 
'2009-07-09 18:33:00',
'2009-07-09 20:55:00',
'2009-07-10 00:01:00',
'2009-07-10 09:00:00',
'2009-07-10 15:00:00',
'2009-07-10 18:00:00',
'2009-07-11 00:01:00',
'2009-07-12 03:10:00',
'2009-07-09 06:00:00',
'2009-07-10 15:00:00',
'2009-07-11 18:00:00',
'2009-07-11 21:00:00',
'2009-07-12 00:30:00',
'2009-07-12 12:05:00',
'2009-07-12 15:00:00',
'2009-07-13 21:00:00',
'2009-07-14 00:01:00'],
   'Score': [0.0, 1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 1.0, 0.0, -1.0, 0.0, 1.0, 0.0, 0.0, -1.0, 0.0, 0.0],
})
print(df)

I'm essentially trying to create 2 indicator columns. The first indicator column follows the rule that for each condition (A and B), once I have a score of -1, I should indicate that row as "1" for the rest of that condition. The second indicator column should indicate for each row whether at least 24 hours has passed since the last score of -1. Thus the final result should look something like:

   cond                 time  Score  Indicator 1    Indicator 2
0     A  2009-07-09 15:00:00    0.0        0             0
1     A  2009-07-09 18:33:00    1.0        0             0
2     A  2009-07-09 20:55:00    0.0        0             0
3     A  2009-07-10 00:01:00    0.0        0             0
4     A  2009-07-10 09:00:00    0.0        0             0
5     A  2009-07-10 15:00:00   -1.0        1             0
6     A  2009-07-10 18:00:00    0.0        1             0
7     A  2009-07-11 00:01:00    0.0        1             0
8     A  2009-07-12 03:10:00    1.0        1             1
9     B  2009-07-09 06:00:00    0.0        0             0
10    B  2009-07-10 15:00:00   -1.0        1             0
11    B  2009-07-11 18:00:00    0.0        1             1
12    B  2009-07-11 21:00:00    1.0        1             1
13    B  2009-07-12 00:30:00    0.0        1             1
14    B  2009-07-12 12:05:00    0.0        1             1
15    B  2009-07-12 15:00:00   -1.0        1             0
16    B  2009-07-13 21:00:00    0.0        1             1
17    B  2009-07-14 00:01:00    0.0        1             1

This is in the similar realm to the question I asked yesterday about Indicator 1, but I realized that because my large data set has so many conditions (700 ), I ended up needing help on how to apply the Indicator 1 solution when it's not feasible to individually write out all the cond values, and for Indicator 2, I was working on using a rolling window function, but all the conditions I saw for rolling window examples were looking at a rolling sums or rolling means which is not what I'm trying to compute here, so I'm unsure if what I want exists using a rolling window.

CodePudding user response:

Try:

#get the first time the score is -1 for each ID
first = df["cond"].map(df[df["Score"].eq(-1)].groupby("cond")["time"].min())

#get the most recent time that the score is -1
recent = df.loc[df["Score"].eq(-1), "time"].reindex(df.index, method="ffill")

#check that the time is greater than the first -1
df["Indicator 1"] = df["time"].ge(first).astype(int)

#check that at least 1 day has passed since the most recent -1
df["Indicator 2"] = df["time"].sub(recent).dt.days.ge(1).astype(int)

>>> df
   cond                time  Score  Indicator 1  Indicator 2
0     A 2009-07-09 15:00:00    0.0            0            0
1     A 2009-07-09 18:33:00    1.0            0            0
2     A 2009-07-09 20:55:00    0.0            0            0
3     A 2009-07-10 00:01:00    0.0            0            0
4     A 2009-07-10 09:00:00    0.0            0            0
5     A 2009-07-10 15:00:00   -1.0            1            0
6     A 2009-07-10 18:00:00    0.0            1            0
7     A 2009-07-11 00:01:00    0.0            1            0
8     A 2009-07-12 03:10:00    1.0            1            1
9     B 2009-07-09 06:00:00    0.0            0            0
10    B 2009-07-10 15:00:00   -1.0            1            0
11    B 2009-07-11 18:00:00    0.0            1            1
12    B 2009-07-11 21:00:00    1.0            1            1
13    B 2009-07-12 00:30:00    0.0            1            1
14    B 2009-07-12 12:05:00    0.0            1            1
15    B 2009-07-12 15:00:00   -1.0            1            0
16    B 2009-07-13 21:00:00    0.0            1            1
17    B 2009-07-14 00:01:00    0.0            1            1

CodePudding user response:

A simple approach IMO, using cummax for the first indicator, and a diff from the first value per group combined with a mask for the second:

# indicator 1
df['Indicator 1'] = df['Score'].eq(-1).astype(int).groupby(df['cond']).cummax()

# indicator 2
# convert to datetime
df['time'] = pd.to_datetime(df['time'])
# groups starting by -1
m1 = df['Score'].eq(-1).groupby(df['cond']).cumsum()
# is the time difference greater than 24h since the group start
m2 = df.groupby(['cond', m1])['time'].apply(lambda s: s.sub(s.iloc[0]).gt('24h'))

df['Indicator 2'] = (m1.eq(0) & m2).astype(int)

Output:

   cond                time  Score  Indicator 1  Indicator 2
0     A 2009-07-09 15:00:00    0.0            0            0
1     A 2009-07-09 18:33:00    1.0            0            0
2     A 2009-07-09 20:55:00    0.0            0            0
3     A 2009-07-10 00:01:00    0.0            0            0
4     A 2009-07-10 09:00:00    0.0            0            0
5     A 2009-07-10 15:00:00   -1.0            1            0
6     A 2009-07-10 18:00:00    0.0            1            0
7     A 2009-07-11 00:01:00    0.0            1            0
8     A 2009-07-12 03:10:00    1.0            1            1
9     B 2009-07-09 06:00:00    0.0            0            0
10    B 2009-07-10 15:00:00   -1.0            1            0
11    B 2009-07-11 18:00:00    0.0            1            1
12    B 2009-07-11 21:00:00    1.0.           1            1
13    B 2009-07-12 00:30:00    0.0            1            1
14    B 2009-07-12 12:05:00    0.0            1            1
15    B 2009-07-12 15:00:00   -1.0            1            0
16    B 2009-07-13 21:00:00    0.0            1            0
17    B 2009-07-14 00:01:00    0.0            1            0
  • Related