Home > Software design >  How to filter rows by group in a pandas dataframe
How to filter rows by group in a pandas dataframe

Time:03-02

Suppose now I have some group data like

GroupID ID Rank target
A 1 1 0
A 2 3 0
A 3 2 1
B 1 1 0
B 2 4 0
B 3 3 1
B 4 2 0
C 1 1 1
C 2 4 0
C 3 3 1
C 4 2 0
D 1 1 0
D 2 4 0
D 3 3 0
D 4 2 0

For each group,

  1. I want to filter the group which has no rows which target=1.

  2. Then I want to keep the row which target==1 and the rows which rank is higher than it. Some group may have many rows which target==1, and we choose the one which rank is lower as our target. For example for group C, the ID=1 and ID=3 all have target==1, we will keep the rows which the rank<=3. So we will get

GroupID ID Rank target
A 1 1 0
A 3 2 1
B 1 1 0
B 3 3 1
B 4 2 0
C 1 1 1
C 3 3 1
C 4 2 0

CodePudding user response:

Replace Rank in Series.where if target is not 1 and then use GroupBy.transform for maximal Rank per group, so possible compare Rank column in boolean indexing by Series.le for less or equal:

s = df['Rank'].where(df['target'].eq(1)).groupby(df['GroupID']).transform('max')
df = df[df['Rank'].le(s)]
print (df)
   GroupID  ID  Rank  target
0        A   1     1       0
2        A   3     2       1
3        B   1     1       0
5        B   3     3       1
6        B   4     2       0
7        C   1     1       1
9        C   3     3       1
10       C   4     2       0

Details:

print (df['Rank'].where(df['target'].eq(1)))
0     NaN
1     NaN
2     2.0
3     NaN
4     NaN
5     3.0
6     NaN
7     1.0
8     NaN
9     3.0
10    NaN
Name: Rank, dtype: float64

print (s)
0     2.0
1     2.0
2     2.0
3     3.0
4     3.0
5     3.0
6     3.0
7     3.0
8     3.0
9     3.0
10    3.0
Name: Rank, dtype: float64

CodePudding user response:

IIUC, make a first pass to slice the rows with target == 1 (using eq), then get the max rank per group using GroupBy.max and select the rows with this maximum rank per group with classical boolean indexing using le:

thresh = df[df['target'].eq(1)].groupby('GroupID')['Rank'].max()

out = df[df['Rank'].le(df['GroupID'].map(thresh))]

output:

   GroupID  ID  Rank  target
0        A   1     1       0
2        A   3     2       1
3        B   1     1       0
5        B   3     3       1
6        B   4     2       0
7        C   1     1       1
9        C   3     3       1
10       C   4     2       0

thresholds:

>>> thresh
GroupID
A    2
B    3
C    3
  • Related