Home > Software design >  quick way to subset pandas dataframe up to when column condition is met the first time
quick way to subset pandas dataframe up to when column condition is met the first time

Time:12-13

Related to this question

I have a dataframe where I want rows for each id till status is 1 for first time. The solution presented in the above thread works perfectly but is very slow. I have 70,000 rows in my dataset.

My dataset:

d = {'id': [1,1,1,1,1,1,1,2,2,2,2,2,2,2], 'status': [0,0,0,0,1,1,1,0,0,0,0,1,0,1]}
df = pd.DataFrame(data=d)

    id  status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    1       1
6    1       1
7    2       0
8    2       0
9    2       0
10   2       0
11   2       1
12   2       0
13   2       1

The desired subset I want is this:

    id  status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
6    2       0
7    2       0
8    2       0
9    2       1

Tried the solution in the related thread:

lambda x: x.cumsum().cumsum().le(1)

works but that is very slow.

CodePudding user response:

You can parallelize the solution in this thread using parallel-pandas. Here is a usage example

import pandas as pd
#pip install parallel-pandas
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8)

#p_apply is parallel analogue of apply method
df = df.groupby('id', group_keys=False)\
       .p_apply(lambda x: x[x.status.cumsum().cumsum().le(1)])\
       .reset_index(drop=1)

CodePudding user response:

You can try:

df[~df.groupby('id', group_keys=False)['status'].apply(lambda s: s.shift().eq(1).cummax())]

Getting the cummax is probably faster than a cumsum

Or, assuming only 0/1 values:

df.groupby('id', group_keys=False).apply(lambda g: g.loc[:g['status'].idxmax()])

CodePudding user response:

Here a possible solution. This can be definitely improved, but it works.

df.groupby("id")['status'].apply(lambda x: x[:x.values.tolist().index(1)   1]).swaplevel()

Explanation:

.apply(lambda x: x[:x.values.tolist().index(1)   1])

This line is composed of two main components.

We look for first occurence of value 1 and get its index x.values.tolist().index(1) Then we add 1 so we can include number that we found. We slice everything from beginning to the end.

This is done for each id group.

VERSION 2

This version handles cases where there is no 1 found in group values:

def magic(x):
    values = x.values.tolist()
    if 1 in values:
        return x[:values.index(1)   1]
    else:
        return x

df.groupby("id")['status'].apply(magic)
  • Related