Home > Net >  Pandas drop duplicates only for main index
Pandas drop duplicates only for main index

Time:11-21

I have a multiindex and I want to perform drop_duplicates on a per level basis, I dont want to look at the entire dataframe but only if there is a duplicate with the same main index

Example:

entry,subentry,A,B

1 0 1.0 1.0
  1 1.0 1.0
  2 2.0 2.0

2 0 1.0 1.0
  1 2.0 2.0
  2 2.0 2.0

should return:

entry,subentry,A,B

1 0 1.0 1.0
  1 2.0 2.0

2 0 1.0 1.0
  1 2.0 2.0

CodePudding user response:

Use MultiIndex.get_level_values with Index.duplicated for filter out last row per entry in boolean indexing:

df1 = df[df.index.get_level_values('entry').duplicated(keep='last')]
print (df1)

                  A    B
entry subentry          
1     0         1.0  1.0
      1         1.0  1.0
2     0         1.0  1.0
      1         2.0  2.0

Or if need remove duplicates per first level and columns convert first level to column by DataFrame.reset_index, for filter invert boolean mask by ~ and convert Series to numpy array, because indices of mask and original DataFrame not match:

df2 = df[~df.reset_index(level=0).duplicated(keep='last').to_numpy()]
print (df2)

                  A    B
entry subentry          
1     1         1.0  1.0
      2         2.0  2.0
2     0         1.0  1.0
      2         2.0  2.0

Or create helper column by first level of MultiIndex:

df2 = df[~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last')]
print (df2)

                  A    B
entry subentry          
1     1         1.0  1.0
      2         2.0  2.0
2     0         1.0  1.0
      2         2.0  2.0

Details:

print (df.reset_index(level=0))
          entry    A    B
subentry                 
0             1  1.0  1.0
1             1  1.0  1.0
2             1  2.0  2.0
0             2  1.0  1.0
1             2  2.0  2.0
2             2  2.0  2.0

print (~df.reset_index(level=0).duplicated(keep='last'))
0    False
1     True
2     True
0     True
1    False
2     True
dtype: bool

print (df.assign(new=df.index.get_level_values('entry')))
                  A    B  new
entry subentry               
1     0         1.0  1.0    1
      1         1.0  1.0    1
      2         2.0  2.0    1
2     0         1.0  1.0    2
      1         2.0  2.0    2
      2         2.0  2.0    2
      
print (~df.assign(new=df.index.get_level_values('entry')).duplicated(keep='last'))
entry  subentry
1      0           False
       1            True
       2            True
2      0            True
       1           False
       2            True
dtype: bool

CodePudding user response:

It looks like you want to drop_duplicates per group:

out = df.groupby(level=0, group_keys=False).apply(lambda d: d.drop_duplicates())

Or, a maybe more efficient variant using a temporary reset_index with duplicated and boolean indexing:

out = df[~df.reset_index('entry').duplicated().values]

Output:

                  A    B
entry subentry          
1     0         1.0  1.0
      2         2.0  2.0
2     0         1.0  1.0
      1         2.0  2.0
  • Related