Given a dataframe, I'd like to remove all rows for each ID that occur before the LAST occurrence of a condition, but only for that specific ID. Alternatively/additionally, is there a way to cleanly partition the table around the occurrence of the condition?
Dates in example tables are unique and sorted, but that might not be the case for what I'm working with.
Example input table:
Condition | ID | Data | Misc | Date | |
---|---|---|---|---|---|
0 | 4 | b | 2 | 3 | 2013-01-01 00:00:00 |
1 | 4 | b | 6 | 7 | 2013-01-02 00:00:00 |
2 | 4 | a | 10 | 11 | 2013-01-03 00:00:00 |
3 | 12 | a | 14 | 15 | 2013-01-04 00:00:00 |
4 | 5 | b | 18 | 19 | 2013-01-05 00:00:00 |
5 | 5 | b | 22 | 23 | 2013-01-06 00:00:00 |
6 | 12 | b | 26 | 27 | 2013-01-07 00:00:00 |
7 | 6 | b | 30 | 31 | 2013-01-08 00:00:00 |
8 | 12 | b | 34 | 35 | 2013-01-09 00:00:00 |
9 | 4 | a | 38 | 39 | 2013-01-10 00:00:00 |
10 | 4 | a | 42 | 43 | 2013-01-11 00:00:00 |
11 | 4 | b | 46 | 47 | 2013-01-12 00:00:00 |
12 | 4 | a | 50 | 51 | 2013-01-13 00:00:00 |
13 | 0 | c | 54 | 55 | 2013-01-14 00:00:00 |
14 | 0 | c | 58 | 59 | 2013-01-15 00:00:00 |
For each of the IDs: "a", "b", "c", how to remove all rows before the last time that Condition was greater than or equal to 12? As "c" never gets that high, none of its rows should be removed.
I've found how to get the rows with the last occurrence via a few hackish ways, mostly involving groupby(), but can't quite figure out how to do that cleanly or do the final removal without iterating.
Desired output:
Condition | ID | Data | Misc | Date | |
---|---|---|---|---|---|
3 | 12 | a | 14 | 15 | 2013-01-04 00:00:00 |
8 | 12 | b | 34 | 35 | 2013-01-09 00:00:00 |
9 | 4 | a | 38 | 39 | 2013-01-10 00:00:00 |
10 | 4 | a | 42 | 43 | 2013-01-11 00:00:00 |
11 | 4 | b | 46 | 47 | 2013-01-12 00:00:00 |
12 | 4 | a | 50 | 51 | 2013-01-13 00:00:00 |
13 | 0 | c | 54 | 55 | 2013-01-14 00:00:00 |
14 | 0 | c | 58 | 59 | 2013-01-15 00:00:00 |
Code for generation and one way I've done via iteration:
import numpy as np
import pandas as pd
# Generate example table
nrows = 15
ncol = 4
dates = pd.date_range("20130101", periods=nrows)
df = pd.DataFrame(np.arange(nrows*ncol).reshape((nrows,4)), columns=["Condition","ID","Data", "Misc"])
df["Date"] = dates
np.random.seed(1)
df["ID"] = np.random.choice(["a","b"], (nrows,))
df["Condition"] = [4,4,4,12,5,5,12,6,12,4,4,4,4,0,0]
df.loc[df["Condition"] == 0, "ID"] = "c"
# Current approach requiring iteration
last_rows= df[df['Condition'] >= 12].reset_index().set_index('Date', drop=False).groupby('ID').max()["Date"]
last_rows
for k, date in zip(last_rows.index, last_rows.values):
df.drop(df[(df['ID'] == k) & (df.Date < date)].index,inplace=True)
CodePudding user response:
Get the indices where the last occurence of 12 or greater occurs:
indices = (df
.assign(boolean = df.Condition.ge(12))
.groupby(['ID', 'boolean'], sort = False)
.tail(1)
.loc[lambda df: df.boolean, 'ID']
)
indices
3 a
8 b
Name: ID, dtype: object
Some more reshaping on the indices, to generate a MultiIndex of all values before the occurence:
indices = pd.Series(indices.index, index = indices)
indices = indices.repeat(indices)
indices = indices.groupby(indices).cumcount()
indices = pd.MultiIndex.from_arrays([indices, indices.index])
indices
MultiIndex([(0, 'a'),
(1, 'a'),
(2, 'a'),
(0, 'b'),
(1, 'b'),
(2, 'b'),
(3, 'b'),
(4, 'b'),
(5, 'b'),
(6, 'b'),
(7, 'b')],
names=[None, 'ID'])
Create a MultiIndex on the main df
and filter out the occurences:
temp = df.set_index('ID', append = True)
# alternative - temp.loc[~temp.index.isin(indices)]
(temp.loc[temp.index.difference(indices)]
.reset_index('ID')
.loc[:, [*df]] # restore to its original form
)
Condition ID Data Misc Date
3 12 a 14 15 2013-01-04 00:00:00
8 12 b 34 35 2013-01-09 00:00:00
9 4 a 38 39 2013-01-10 00:00:00
10 4 a 42 43 2013-01-11 00:00:00
11 4 b 46 47 2013-01-12 00:00:00
12 4 a 50 51 2013-01-13 00:00:00
13 0 c 54 55 2013-01-14 00:00:00
14 0 c 58 59 2013-01-15 00:00:00
CodePudding user response:
It's a little ugly but you could filter it like:
idx = df[df['Condition']>=12].reset_index().groupby('ID')['index'].max()
mask = np.array([(df['ID']==Id)&(df.index>=idx[Id]) for Id in idx.index]
[df['ID']==Id for Id in set(df['ID']).difference(set(idx.index))]).any(axis=0)
new_df = df[mask]
What this is doing is,
(i) find the indices of the last times an ID
value has a Condition
>=12 and save it as idx
.
(ii)(a) For each value in ID
that also appears in idx
(in this example, a
and b
), filter by index, (b) filter the ID
s not in idx
(in this example, c
), and concatenate all of these dataframes as new_df
.