Pandas - Drop rows based on multiple columns including max/min-CodePudding

I have a pandas DataFrame where I used groupby.ngroup() to identify groups of related data (basically duplicated data, but not exactly because that would have been too easy...).

DisID	BunchData	GroupID
1000	xyz	1
2012	abc	2
2014	abc	2
3000	def	3

I am trying to figure out how to remove all rows EXCEPT the max "DisID" within a GroupID, only if there exists more than one row in a GroupID. In this case, the output would look like:

DisID	BunchData	GroupID
1000	xyz	1
2014	abc	2
3000	def	3

Thanks!

Updated with a snippet of the actual code that is throwing TypeError on ~s:

docsDF['GroupID'] = docsDF.groupby(['ExecutionTimestamp',Notional','Trade_type']).ngroup()

s = docsDF.sort_values("DisseminationID").drop_duplicates(['GroupID'])
docsDF.drop(s.index[~s],inplace=True)

Error on last line is:

TypeError: bad operand type for unary ~: 'str'

CodePudding user response：

You can get the index of the min DisID per group, identify the groups of size greater than 1, and drop the found indices:

g = df.groupby('BunchData')['DisID']
# index of min
s1 = g.idxmin()
# size of each group
s2 = g.size()

# indices of groups larger than 1
idx = s1.loc[s2[s2.gt(1)].index]

out = df.drop(idx)

Output:

   DisID BunchData  GroupID
0   1000       xyz        1
2   2014       abc        2
3   3000       def        3

CodePudding user response：

Let us do sort_values then drop_duplicates

df = df.sort_values('DisID').drop_duplicates(['GroupID'],keep='last')
Out[170]: 
   DisID BunchData  GroupID
0   1000       xyz        1
2   2014       abc        2
3   3000       def        3

Update for the comments

s = df.sort_values('DisID').duplicated(['GroupID'])
df.drop(s.index[~s],inplace=True)
Out[183]: 
    DisID BunchData  GroupID
2    2014       abc        2
4    2000       xyz        1
5    4024       abc        2
6    4028       abc        2
7    6000       def        3
8    4000       xyz        1
9    8048       abc        2
10   8056       abc        2
11  12000       def        3

CodePudding user response：

Update:

Removing all rows except the max, is the same as just keeping the max of each group.

df.groupby('GroupID', as_index=False).max()
# OR - If the data is more complicated:
df.loc[df.groupby('GroupID')['DisID'].idxmax()]

Groups by GroupID
Gets aggregate information about DisID
- idxmin
- size
Filters out 1 size groups
Drops the remaining idxmin indices.

df.drop(df.groupby('GroupID')['DisID'].agg(['idxmin', 'size'])[lambda x: x['size'].gt(1)]['idxmin'])

Output:

   DisID BunchData  GroupID
0   1000       xyz        1
2   2014       abc        2
3   3000       def        3

A more complicated dataset:

    DisID BunchData  GroupID
0    1000       xyz        1
1    2012       abc        2
2    2014       abc        2
3    3000       def        3
4    2000       xyz        1
5    4024       abc        2
6    4028       abc        2
7    6000       def        3
8    4000       xyz        1
9    8048       abc        2
10   8056       abc        2
11  12000       def        3

Expected (and received) Output:

    DisID BunchData  GroupID
2    2014       abc        2
4    2000       xyz        1
5    4024       abc        2
6    4028       abc        2
7    6000       def        3
8    4000       xyz        1
9    8048       abc        2
10   8056       abc        2
11  12000       def        3

Output from accepted answer:

    DisID BunchData  GroupID
8    4000       xyz        1
10   8056       abc        2
11  12000       def        3