Home > Back-end >  Find all rows in pandas-dataframe with duplicate row index
Find all rows in pandas-dataframe with duplicate row index

Time:08-10

Assume I have a dataframe similar to:

          A  B
0  [ab, cd]  1
1  [cd, de]  2
2  [de, ef]  1
3      [gh]  1

I am now interested in finding all "duplicate" entries when splitting it up by using .explode("A"). Using explode() I get the following output:

    A  B
0  ab  1
0  cd  1
1  cd  2
1  de  2
2  de  1
2  ef  1
3  gh  1

I would like to get the position of all rows where the index is more than once (i.e. the rows with indices [0, 1, 2]). One option would be of course to iterate over all rows and check if I get multiple rows when calling df.loc[i], but I could imagine this being very slow. Are there faster in-built options, ideally mapping the returned rows to the "real" row indices?

I.e. calling df.loc[0] would return

    A  B
0  ab  1
0  cd  1

and in addition to that, I would be interested in the "real" indices, here being [0, 1]. How could I achieve that?

CodePudding user response:

Adding to Jezrael's good answer, you can always use reset_index() and get what you call "real" index.

df = df.explode("A").reset_index().rename(columns={'index':'Old Index'})

Outputting:

   Old Index   A  B
0          0  ab  1
1          0  cd  1
2          1  cd  2
3          1  de  2
4          2  de  1
5          2  ef  1
6          3  gh  1

And you can then filter easily by using:

uniques = df.drop_duplicates('Old Index',keep=False)
duplicated = df.drop(uniques.index)

CodePudding user response:

For duplicated indices check them by Index.duplicated in boolean indexing:

df1 = df.explode("A")
#first duplicated index
idx = df1.index[df1.index.duplicated()]
print (idx)
Int64Index([0, 1, 2], dtype='int64')

#all duplicated index
idx = df1.index[df1.index.duplicated(keep=False)]
print (idx)
Int64Index([0, 0, 1, 1, 2, 2], dtype='int64')

If need default index values:

m = df1.index.duplicated(keep=False)
df2 = df1.reset_index(drop=True)
print (df2)
    A  B
0  ab  1
1  cd  1
2  cd  2
3  de  2
4  de  1
5  ef  1
6  gh  1

With filtering by duplicated index of df1.index:

print(df2[m])
    A  B
0  ab  1
1  cd  1
2  cd  2
3  de  2
4  de  1
5  ef  1

print(df2[~m])
    A  B
6  gh  1
  • Related