Assume I have a dataframe similar to:
A B
0 [ab, cd] 1
1 [cd, de] 2
2 [de, ef] 1
3 [gh] 1
I am now interested in finding all "duplicate" entries when splitting it up by using .explode("A")
. Using explode()
I get the following output:
A B
0 ab 1
0 cd 1
1 cd 2
1 de 2
2 de 1
2 ef 1
3 gh 1
I would like to get the position of all rows where the index is more than once (i.e. the rows with indices [0, 1, 2]). One option would be of course to iterate over all rows and check if I get multiple rows when calling df.loc[i]
, but I could imagine this being very slow. Are there faster in-built options, ideally mapping the returned rows to the "real" row indices?
I.e. calling df.loc[0] would return
A B
0 ab 1
0 cd 1
and in addition to that, I would be interested in the "real" indices, here being [0, 1]
. How could I achieve that?
CodePudding user response:
Adding to Jezrael's good answer, you can always use reset_index()
and get what you call "real" index.
df = df.explode("A").reset_index().rename(columns={'index':'Old Index'})
Outputting:
Old Index A B
0 0 ab 1
1 0 cd 1
2 1 cd 2
3 1 de 2
4 2 de 1
5 2 ef 1
6 3 gh 1
And you can then filter easily by using:
uniques = df.drop_duplicates('Old Index',keep=False)
duplicated = df.drop(uniques.index)
CodePudding user response:
For duplicated indices check them by Index.duplicated
in boolean indexing
:
df1 = df.explode("A")
#first duplicated index
idx = df1.index[df1.index.duplicated()]
print (idx)
Int64Index([0, 1, 2], dtype='int64')
#all duplicated index
idx = df1.index[df1.index.duplicated(keep=False)]
print (idx)
Int64Index([0, 0, 1, 1, 2, 2], dtype='int64')
If need default index values:
m = df1.index.duplicated(keep=False)
df2 = df1.reset_index(drop=True)
print (df2)
A B
0 ab 1
1 cd 1
2 cd 2
3 de 2
4 de 1
5 ef 1
6 gh 1
With filtering by duplicated index of df1.index
:
print(df2[m])
A B
0 ab 1
1 cd 1
2 cd 2
3 de 2
4 de 1
5 ef 1
print(df2[~m])
A B
6 gh 1