Home > Back-end >  Select a subset of a dataframe based on conditions : category and note
Select a subset of a dataframe based on conditions : category and note

Time:10-14

I am trying to Select a subset of a dataframe where following conditions are satisfied:

  • for same category keep only the row with highest note,
  • if category=na keep the row

Here's my dataframe example :

enter image description here

The expected result:

enter image description here

What is the efficient way to do it? Thank you

CodePudding user response:

Use DataFrame.sort_values with mask chained DataFrame.duplicated with bitwise OR for missing rows by same column:

df1 = df.sort_values(['category_id','note'])

df1 = df1[~df1.duplicated(['category_id'], keep='last') | 
           df1['category_id'].isna()].sort_index()
print (df1)
   book_id category_id  note
1      id2          c1     5
4      id5         NaN     1
5      id6         NaN     7
6      id7          c2     6
7      id8          c3     2
8      id9         NaN     8
9     id10         NaN     4
10    id11         NaN     9

Or use Series.fillna with range (necessary non integers in category_id) and then use DataFrameGroupBy.idxmax:

s = df['category_id'].fillna(pd.Series(range(len(df)), index=df.index))

df1 = df.loc[df.groupby(s)['note'].idxmax()].sort_index()
print (df1)
   book_id category_id  note
1      id2          c1     5
4      id5         NaN     1
5      id6         NaN     7
6      id7          c2     6
7      id8          c3     2
8      id9         NaN     8
9     id10         NaN     4
10    id11         NaN     9       

CodePudding user response:

Sorting has a O(n*logn) complexity, so better use a linear time approach when you can.

You can use boolean indexing with two masks:

# is the row a NA?
m1 = df['category_id'].isna()
# is the row the max value for a non NA?
m2 = df.index.isin(df.groupby('category_id')['note'].idxmax())
# or if you want to keep all max rows if several
# m2 = df['note'].eq(df.groupby('category_id')['note'].transform('max'))

# keep if any condition is met
out = df.loc[m1|m2]

output:

   book_id category_id  note
1      id2          c1     5
4      id5         NaN     1
5      id6         NaN     7
6      id7          c2     6
7      id8          c3     2
8      id9         NaN     8
9     id10         NaN     4
10    id11         NaN     9

how does it work?

Here are the intermediates for the boolean masks, ultimately only the rows with True in the last mask are kept:

   book_id category_id  note  is na?  is max?  is either?
0      id1          c1     2   False    False       False
1      id2          c1     5   False     True        True
2      id3          c1     3   False    False       False
3      id4          c2     4   False    False       False
4      id5         NaN     1    True    False        True
5      id6         NaN     7    True    False        True
6      id7          c2     6   False     True        True
7      id8          c3     2   False     True        True
8      id9         NaN     8    True    False        True
9     id10         NaN     4    True    False        True
10    id11         NaN     9    True    False        True
  • Related