I would group the duplicate values in each column of a Pandas Dataframe as follows:
import pandas as pd
ls = [[0,'A',2],
[2,'B',1],
[1,'A',3],
[1,'C',2],
[1,'D',3]]
df = pd.DataFrame(ls)
print(df)
results = []
for i in range(len(ls[0])):
duplicates = {}
for e in range(len(df[i])):
key = df[i][e]
if key in duplicates:
duplicates[key].append(e)
else:
tmp = [e]
duplicates[key] = tmp
results.append(duplicates)
print(results)
#Output:
# [
# {0: [0], 2: [1], 1: [2, 3, 4]},
# {'A': [0, 2], 'B': [1], 'C': [3], 'D': [4]},
# {2: [0, 3], 1: [1], 3: [2, 4]}
# ]
The problem is that the number of rows in my dataframe can be large, i.e. 1M-2M, and the method becomes inefficient. Can you suggest an efficient approach to improve the performance of this solution? Does Panda or Numpy have methods that allow you to efficiently get a similar result?
CodePudding user response:
If ony few columns you can use list comprehension for aggregate values of index converted to Series
by Index.to_series
:
idx = df.index.to_series()
d = [idx.groupby(df[c], sort=False).agg(list).to_dict() for c in df.columns]
print (d)
[{0: [0], 2: [1], 1: [2, 3, 4]},
{'A': [0, 2], 'B': [1], 'C': [3], 'D': [4]},
{2: [0, 3], 1: [1], 3: [2, 4]}]
Or if many columns reshape by DataFrame.melt
, aggregate lsit and last flatten MultiIndex Series
:
s = (df.melt(ignore_index=False)
.reset_index()
.groupby(['variable','value'], sort=False)['index']
.agg(list))
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
Simpliest is solution inspired another answer:
d = [df.groupby(c, sort=False).groups for c in df.columns]
CodePudding user response:
You can go with groupby in pandas which is quite easy and simple
result = []
for column in range(len(df.columns)):
gp = df.groupby(column)
result.append(gp.groups)
print(result)
the corresponding output is
[{0: [0], 1: [2, 3, 4], 2: [1]}, {'A': [0, 2], 'B': [1], 'C': [3], 'D': [4]}, {1: [1], 2: [0, 3], 3: [2, 4]}]