Home > Mobile >  Partition or remove all rows in dataframe before last occurrence of a condition, for each subgroup?
Partition or remove all rows in dataframe before last occurrence of a condition, for each subgroup?

Time:12-04

Given a dataframe, I'd like to remove all rows for each ID that occur before the LAST occurrence of a condition, but only for that specific ID. Alternatively/additionally, is there a way to cleanly partition the table around the occurrence of the condition?

Dates in example tables are unique and sorted, but that might not be the case for what I'm working with.

Example input table:

Condition ID Data Misc Date
0 4 b 2 3 2013-01-01 00:00:00
1 4 b 6 7 2013-01-02 00:00:00
2 4 a 10 11 2013-01-03 00:00:00
3 12 a 14 15 2013-01-04 00:00:00
4 5 b 18 19 2013-01-05 00:00:00
5 5 b 22 23 2013-01-06 00:00:00
6 12 b 26 27 2013-01-07 00:00:00
7 6 b 30 31 2013-01-08 00:00:00
8 12 b 34 35 2013-01-09 00:00:00
9 4 a 38 39 2013-01-10 00:00:00
10 4 a 42 43 2013-01-11 00:00:00
11 4 b 46 47 2013-01-12 00:00:00
12 4 a 50 51 2013-01-13 00:00:00
13 0 c 54 55 2013-01-14 00:00:00
14 0 c 58 59 2013-01-15 00:00:00

For each of the IDs: "a", "b", "c", how to remove all rows before the last time that Condition was greater than or equal to 12? As "c" never gets that high, none of its rows should be removed.

I've found how to get the rows with the last occurrence via a few hackish ways, mostly involving groupby(), but can't quite figure out how to do that cleanly or do the final removal without iterating.

Desired output:

Condition ID Data Misc Date
3 12 a 14 15 2013-01-04 00:00:00
8 12 b 34 35 2013-01-09 00:00:00
9 4 a 38 39 2013-01-10 00:00:00
10 4 a 42 43 2013-01-11 00:00:00
11 4 b 46 47 2013-01-12 00:00:00
12 4 a 50 51 2013-01-13 00:00:00
13 0 c 54 55 2013-01-14 00:00:00
14 0 c 58 59 2013-01-15 00:00:00

Code for generation and one way I've done via iteration:

import numpy as np
import pandas as pd

# Generate example table
nrows = 15
ncol = 4
dates = pd.date_range("20130101", periods=nrows)
df = pd.DataFrame(np.arange(nrows*ncol).reshape((nrows,4)), columns=["Condition","ID","Data", "Misc"])
df["Date"] = dates
np.random.seed(1)
df["ID"] = np.random.choice(["a","b"], (nrows,))
df["Condition"] = [4,4,4,12,5,5,12,6,12,4,4,4,4,0,0]
df.loc[df["Condition"] == 0, "ID"] = "c"

# Current approach requiring iteration
last_rows= df[df['Condition'] >= 12].reset_index().set_index('Date', drop=False).groupby('ID').max()["Date"]
last_rows
for k, date in zip(last_rows.index, last_rows.values):
    df.drop(df[(df['ID'] == k) & (df.Date < date)].index,inplace=True)

CodePudding user response:

Get the indices where the last occurence of 12 or greater occurs:

indices = (df
            .assign(boolean = df.Condition.ge(12))
            .groupby(['ID', 'boolean'], sort = False)
            .tail(1)
            .loc[lambda df: df.boolean, 'ID']
           )

indices

3    a
8    b
Name: ID, dtype: object

Some more reshaping on the indices, to generate a MultiIndex of all values before the occurence:

indices = pd.Series(indices.index, index = indices)
indices = indices.repeat(indices)
indices = indices.groupby(indices).cumcount()
indices = pd.MultiIndex.from_arrays([indices, indices.index])

indices

MultiIndex([(0, 'a'),
            (1, 'a'),
            (2, 'a'),
            (0, 'b'),
            (1, 'b'),
            (2, 'b'),
            (3, 'b'),
            (4, 'b'),
            (5, 'b'),
            (6, 'b'),
            (7, 'b')],
           names=[None, 'ID'])

Create a MultiIndex on the main df and filter out the occurences:

temp = df.set_index('ID', append = True)

# alternative - temp.loc[~temp.index.isin(indices)] 
(temp.loc[temp.index.difference(indices)]
     .reset_index('ID')
     .loc[:, [*df]] # restore to its original form
)

    Condition ID  Data  Misc                 Date
3          12  a    14    15  2013-01-04 00:00:00
8          12  b    34    35  2013-01-09 00:00:00
9           4  a    38    39  2013-01-10 00:00:00
10          4  a    42    43  2013-01-11 00:00:00
11          4  b    46    47  2013-01-12 00:00:00
12          4  a    50    51  2013-01-13 00:00:00
13          0  c    54    55  2013-01-14 00:00:00
14          0  c    58    59  2013-01-15 00:00:00

CodePudding user response:

It's a little ugly but you could filter it like:

idx = df[df['Condition']>=12].reset_index().groupby('ID')['index'].max()
mask = np.array([(df['ID']==Id)&(df.index>=idx[Id]) for Id in idx.index] 
           [df['ID']==Id for Id in set(df['ID']).difference(set(idx.index))]).any(axis=0)
new_df = df[mask]

What this is doing is,

(i) find the indices of the last times an ID value has a Condition >=12 and save it as idx.

(ii)(a) For each value in ID that also appears in idx (in this example, a and b), filter by index, (b) filter the IDs not in idx (in this example, c), and concatenate all of these dataframes as new_df.

  • Related