I have a pandas DataFrame where I used groupby.ngroup() to identify groups of related data (basically duplicated data, but not exactly because that would have been too easy...).
DisID | BunchData | GroupID |
---|---|---|
1000 | xyz | 1 |
2012 | abc | 2 |
2014 | abc | 2 |
3000 | def | 3 |
I am trying to figure out how to remove all rows EXCEPT the max "DisID" within a GroupID, only if there exists more than one row in a GroupID. In this case, the output would look like:
DisID | BunchData | GroupID |
---|---|---|
1000 | xyz | 1 |
2014 | abc | 2 |
3000 | def | 3 |
Thanks!
Updated with a snippet of the actual code that is throwing TypeError on ~s:
docsDF['GroupID'] = docsDF.groupby(['ExecutionTimestamp',Notional','Trade_type']).ngroup()
s = docsDF.sort_values("DisseminationID").drop_duplicates(['GroupID'])
docsDF.drop(s.index[~s],inplace=True)
Error on last line is:
TypeError: bad operand type for unary ~: 'str'
CodePudding user response:
You can get the index of the min DisID per group, identify the groups of size greater than 1, and drop the found indices:
g = df.groupby('BunchData')['DisID']
# index of min
s1 = g.idxmin()
# size of each group
s2 = g.size()
# indices of groups larger than 1
idx = s1.loc[s2[s2.gt(1)].index]
out = df.drop(idx)
Output:
DisID BunchData GroupID
0 1000 xyz 1
2 2014 abc 2
3 3000 def 3
CodePudding user response:
Let us do sort_values
then drop_duplicates
df = df.sort_values('DisID').drop_duplicates(['GroupID'],keep='last')
Out[170]:
DisID BunchData GroupID
0 1000 xyz 1
2 2014 abc 2
3 3000 def 3
Update for the comments
s = df.sort_values('DisID').duplicated(['GroupID'])
df.drop(s.index[~s],inplace=True)
Out[183]:
DisID BunchData GroupID
2 2014 abc 2
4 2000 xyz 1
5 4024 abc 2
6 4028 abc 2
7 6000 def 3
8 4000 xyz 1
9 8048 abc 2
10 8056 abc 2
11 12000 def 3
CodePudding user response:
Update:
Removing all rows except the max, is the same as just keeping the max of each group.
df.groupby('GroupID', as_index=False).max()
# OR - If the data is more complicated:
df.loc[df.groupby('GroupID')['DisID'].idxmax()]
- Groups by
GroupID
- Gets aggregate information about
DisID
idxmin
size
- Filters out 1
size
groups - Drops the remaining
idxmin
indices.
df.drop(df.groupby('GroupID')['DisID'].agg(['idxmin', 'size'])[lambda x: x['size'].gt(1)]['idxmin'])
Output:
DisID BunchData GroupID
0 1000 xyz 1
2 2014 abc 2
3 3000 def 3
A more complicated dataset:
DisID BunchData GroupID
0 1000 xyz 1
1 2012 abc 2
2 2014 abc 2
3 3000 def 3
4 2000 xyz 1
5 4024 abc 2
6 4028 abc 2
7 6000 def 3
8 4000 xyz 1
9 8048 abc 2
10 8056 abc 2
11 12000 def 3
Expected (and received) Output:
DisID BunchData GroupID
2 2014 abc 2
4 2000 xyz 1
5 4024 abc 2
6 4028 abc 2
7 6000 def 3
8 4000 xyz 1
9 8048 abc 2
10 8056 abc 2
11 12000 def 3
Output from accepted answer:
DisID BunchData GroupID
8 4000 xyz 1
10 8056 abc 2
11 12000 def 3