Home > front end >  Find indexes of duplicates in each column Pandas dataframe
Find indexes of duplicates in each column Pandas dataframe

Time:05-25

I would group the duplicate values in each column of a Pandas Dataframe as follows:

import pandas as pd

ls = [[0,'A',2],
      [2,'B',1],
      [1,'A',3],
      [1,'C',2],
      [1,'D',3]]

df = pd.DataFrame(ls)
print(df)

results = []
for i in range(len(ls[0])):
    duplicates = {}
    for e in range(len(df[i])):
        key = df[i][e]
        if key in duplicates:
            duplicates[key].append(e)
        else:
            tmp = [e]
            duplicates[key] = tmp
    results.append(duplicates)

print(results)
#Output:
# [
# {0: [0], 2: [1], 1: [2, 3, 4]}, 
# {'A': [0, 2], 'B': [1], 'C': [3], 'D': [4]}, 
# {2: [0, 3], 1: [1], 3: [2, 4]}
# ]

The problem is that the number of rows in my dataframe can be large, i.e. 1M-2M, and the method becomes inefficient. Can you suggest an efficient approach to improve the performance of this solution? Does Panda or Numpy have methods that allow you to efficiently get a similar result?

CodePudding user response:

If ony few columns you can use list comprehension for aggregate values of index converted to Series by Index.to_series:

idx = df.index.to_series()
d = [idx.groupby(df[c], sort=False).agg(list).to_dict() for c in df.columns]
print (d)
[{0: [0], 2: [1], 1: [2, 3, 4]}, 
 {'A': [0, 2], 'B': [1], 'C': [3], 'D': [4]},
 {2: [0, 3], 1: [1], 3: [2, 4]}]

Or if many columns reshape by DataFrame.melt, aggregate lsit and last flatten MultiIndex Series:

s = (df.melt(ignore_index=False)
       .reset_index()
       .groupby(['variable','value'], sort=False)['index']
       .agg(list))

d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}

Simpliest is solution inspired another answer:

d = [df.groupby(c, sort=False).groups for c in df.columns] 

CodePudding user response:

You can go with groupby in pandas which is quite easy and simple

result = []
for column in range(len(df.columns)):
    gp = df.groupby(column)
    result.append(gp.groups)
print(result)

the corresponding output is

[{0: [0], 1: [2, 3, 4], 2: [1]}, {'A': [0, 2], 'B': [1], 'C': [3], 'D': [4]}, {1: [1], 2: [0, 3], 3: [2, 4]}]
  • Related