quick way to subset pandas dataframe up to when column condition is met the first time-CodePudding

Related to this question

I have a dataframe where I want rows for each id till status is 1 for first time. The solution presented in the above thread works perfectly but is very slow. I have 70,000 rows in my dataset.

My dataset:

d = {'id': [1,1,1,1,1,1,1,2,2,2,2,2,2,2], 'status': [0,0,0,0,1,1,1,0,0,0,0,1,0,1]}
df = pd.DataFrame(data=d)

    id  status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    1       1
6    1       1
7    2       0
8    2       0
9    2       0
10   2       0
11   2       1
12   2       0
13   2       1

The desired subset I want is this:

    id  status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
6    2       0
7    2       0
8    2       0
9    2       1

Tried the solution in the related thread:

lambda x: x.cumsum().cumsum().le(1)

works but that is very slow.

CodePudding user response：

You can parallelize the solution in this thread using parallel-pandas. Here is a usage example

import pandas as pd
#pip install parallel-pandas
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8)

#p_apply is parallel analogue of apply method
df = df.groupby('id', group_keys=False)\
       .p_apply(lambda x: x[x.status.cumsum().cumsum().le(1)])\
       .reset_index(drop=1)

CodePudding user response：

You can try:

df[~df.groupby('id', group_keys=False)['status'].apply(lambda s: s.shift().eq(1).cummax())]

Getting the cummax is probably faster than a cumsum

Or, assuming only 0/1 values:

df.groupby('id', group_keys=False).apply(lambda g: g.loc[:g['status'].idxmax()])

CodePudding user response：

Here a possible solution. This can be definitely improved, but it works.

df.groupby("id")['status'].apply(lambda x: x[:x.values.tolist().index(1)   1]).swaplevel()

Explanation:

.apply(lambda x: x[:x.values.tolist().index(1)   1])

This line is composed of two main components.

We look for first occurence of value 1 and get its index x.values.tolist().index(1) Then we add 1 so we can include number that we found. We slice everything from beginning to the end.

This is done for each id group.

VERSION 2

This version handles cases where there is no 1 found in group values:

def magic(x):
    values = x.values.tolist()
    if 1 in values:
        return x[:values.index(1)   1]
    else:
        return x

df.groupby("id")['status'].apply(magic)