Get a column of list of indices showing which rows are equal in a Pandas Dataframe-CodePudding

I have a Pandas dataframe:

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])

I would like to get a column indicating in each row the indeces of all the rows with the same value as the given row. I do:

df = df.assign(sameas = df.col0.apply(lambda val: [i for i, e in enumerate(df.col0) if e==val]))

I get:

   col0        sameas
0     3           [0]
1     4  [1, 3, 4, 9]
2     2  [2, 6, 7, 8]
3     4  [1, 3, 4, 9]
4     4  [1, 3, 4, 9]
5     1           [5]
6     2  [2, 6, 7, 8]
7     2  [2, 6, 7, 8]
8     2  [2, 6, 7, 8]
9     4  [1, 3, 4, 9]

Which is the expected result. In my real world application, the df is much bigger, and this method does not complete in required time.

I think the runtime scales with the square of the number of rows, which is bad. How can I do the above computation faster?

CodePudding user response：

You can just do groupby with transform

df['new'] = df.reset_index().groupby('col0')['index'].transform(lambda x : [x.tolist()]*len(x)).values
Out[146]: 
0             [0]
1    [1, 3, 4, 9]
2    [2, 6, 7, 8]
3    [1, 3, 4, 9]
4    [1, 3, 4, 9]
5             [5]
6    [2, 6, 7, 8]
7    [2, 6, 7, 8]
8    [2, 6, 7, 8]
9    [1, 3, 4, 9]
Name: index, dtype: object

CodePudding user response：

You can try to groupby col0 and convert the grouped index to list

df['sameas'] = df['col0'].map(df.reset_index().groupby('col0')['index'].apply(list))

print(df)

   col0        sameas
0     3           [0]
1     4  [1, 3, 4, 9]
2     2  [2, 6, 7, 8]
3     4  [1, 3, 4, 9]
4     4  [1, 3, 4, 9]
5     1           [5]
6     2  [2, 6, 7, 8]
7     2  [2, 6, 7, 8]
8     2  [2, 6, 7, 8]
9     4  [1, 3, 4, 9]

CodePudding user response：

try:

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
df
'''
   col0
0     3
1     4
2     2
3     4
4     4
5     1
6     2
7     2
8     2
9     4
'''

get a Series as mapping:

ser = df.groupby('col0').apply(lambda x: x.index.to_list())
ser
col0
1             [5]
2    [2, 6, 7, 8]
3             [0]
4    [1, 3, 4, 9]
dtype: object

use mapping:

df.assign(col1=df.col0.map(ser))
'''
   col0          col1
0     3           [0]
1     4  [1, 3, 4, 9]
2     2  [2, 6, 7, 8]
3     4  [1, 3, 4, 9]
4     4  [1, 3, 4, 9]
5     1           [5]
6     2  [2, 6, 7, 8]
7     2  [2, 6, 7, 8]
8     2  [2, 6, 7, 8]
9     4  [1, 3, 4, 9]
'''

CodePudding user response：

On-liner method:

df['col1'] = [df[df.col0.values == i].index.tolist()for i in df.col0.values]
df

Output:

index	col0	col1
0	3	0
1	4	1,3,4,9
2	2	2,6,7,8
3	4	1,3,4,9
4	4	1,3,4,9
5	1	5
6	2	2,6,7,8
7	2	2,6,7,8
8	2	2,6,7,8
9	4	1,3,4,9