I have a dataframe where I want rows for each id till status is 1 for first time. The solution presented in the above thread works perfectly but is very slow. I have 70,000 rows in my dataset.
My dataset:
d = {'id': [1,1,1,1,1,1,1,2,2,2,2,2,2,2], 'status': [0,0,0,0,1,1,1,0,0,0,0,1,0,1]}
df = pd.DataFrame(data=d)
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 1 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 0
13 2 1
The desired subset I want is this:
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Tried the solution in the related thread:
lambda x: x.cumsum().cumsum().le(1)
works but that is very slow.
CodePudding user response:
You can parallelize the solution in this thread using parallel-pandas. Here is a usage example
import pandas as pd
#pip install parallel-pandas
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8)
#p_apply is parallel analogue of apply method
df = df.groupby('id', group_keys=False)\
.p_apply(lambda x: x[x.status.cumsum().cumsum().le(1)])\
.reset_index(drop=1)
CodePudding user response:
You can try:
df[~df.groupby('id', group_keys=False)['status'].apply(lambda s: s.shift().eq(1).cummax())]
Getting the cummax
is probably faster than a cumsum
Or, assuming only 0/1 values:
df.groupby('id', group_keys=False).apply(lambda g: g.loc[:g['status'].idxmax()])
CodePudding user response:
Here a possible solution. This can be definitely improved, but it works.
df.groupby("id")['status'].apply(lambda x: x[:x.values.tolist().index(1) 1]).swaplevel()
Explanation:
.apply(lambda x: x[:x.values.tolist().index(1) 1])
This line is composed of two main components.
We look for first occurence of value 1 and get its index x.values.tolist().index(1)
Then we add 1 so we can include number that we found.
We slice everything from beginning to the end.
This is done for each id group.
VERSION 2
This version handles cases where there is no 1 found in group values:
def magic(x):
values = x.values.tolist()
if 1 in values:
return x[:values.index(1) 1]
else:
return x
df.groupby("id")['status'].apply(magic)