Home > Blockchain >  how to select group of rows from a dataframe if all rows follow a sequence
how to select group of rows from a dataframe if all rows follow a sequence

Time:08-18

I'm currently working on a dataframe that has processes (based on ID) that may or not reach the end of the process. The end of the process is defined as the activity which has index=6. What i need to do is to filter those processes (ID) based on the fact they are completed, which means all 6 the activities are done (so in the process we'll have activities which have index equal to 1,2,3,4,5 and 6 in this specific order).

the dataframe is structured as follows:

ID          A  index           
1   activity1      1 
1   activity2      2    
1   activity3      3    
1   activity4      4    
1   activity5      5    
1   activity6      6    
2   activity7      1    
2   activity8      2    
2   activity9      3    
3   activity10     1    
3   activity11     2    
3   activity12     3  
3   activity13     4    
3   activity14     5    
3   activity15     6    

And the resulting dataframe should be:

ID          A   index           
1   activity1      1 
1   activity2      2    
1   activity3      3    
1   activity4      4    
1   activity5      5    
1   activity6      6    
3   activity10     1    
3   activity11     2    
3   activity12     3  
3   activity13     4    
3   activity14     5    
3   activity15     6    

I've tried to do so working with sum(), creating a new column 'a' and checking if the sum of every group was greater than 20 (which means taking groups in which the sum() is at least 21, which is the sum of 1,2,3,4,5,6) with the function gt().

df['a'] = df['index'].groupby(df['index']).sum()
df2 = df[df['a'].gt(20)] 

Probably this isn't the best approach, so also other approaches are more than welcome. Any idea on how to select rows based on this condition?

CodePudding user response:

this may not be the fastest method, especially on a large dataframe, but it does the job

df = df.loc[df.groupby(['ID'])['index'].transform(lambda x: list(x)==list(range(1,7)))]

Or this other variation:

df = df.loc[df.groupby('ID')['index'].filter(lambda x: list(x)==list(range(1,7))).index]

Output:


ID  A   index
0   1   activity1   1
1   1   activity2   2
2   1   activity3   3
3   1   activity4   4
4   1   activity5   5
5   1   activity6   6
9   3   activity10  1
10  3   activity11  2
11  3   activity12  3
12  3   activity13  4
13  3   activity14  5
14  3   activity15  6

CodePudding user response:

Another possible solution:

out = (df.groupby('ID')
       .filter(lambda g: (len(g['index']) == 6) and 
       (g['index'].eq([*range(1,7)]).all())))

print(out)

   ID           A  index
0    1   activity1      1
1    1   activity2      2
2    1   activity3      3
3    1   activity4      4
4    1   activity5      5
5    1   activity6      6
9    3  activity10      1
10   3  activity11      2
11   3  activity12      3
12   3  activity13      4
13   3  activity14      5
14   3  activity15      6
  • Related