I have a Pandas dataframe:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
ie
col0
0 3
1 4
2 2
3 4
4 4
5 1
6 2
7 2
8 2
9 4
I would like to get a column indicating in each row the indeces of all the rows with the same value as the given row. I do:
df = df.assign(sameas = df.col0.apply(lambda val: [i for i, e in enumerate(df.col0) if e==val]))
I get:
col0 sameas
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
Which is the expected result. In my real world application, the df
is much bigger, and this method does not complete in required time.
I think the runtime scales with the square of the number of rows, which is bad. How can I do the above computation faster?
CodePudding user response:
You can just do groupby
with transform
df['new'] = df.reset_index().groupby('col0')['index'].transform(lambda x : [x.tolist()]*len(x)).values
Out[146]:
0 [0]
1 [1, 3, 4, 9]
2 [2, 6, 7, 8]
3 [1, 3, 4, 9]
4 [1, 3, 4, 9]
5 [5]
6 [2, 6, 7, 8]
7 [2, 6, 7, 8]
8 [2, 6, 7, 8]
9 [1, 3, 4, 9]
Name: index, dtype: object
CodePudding user response:
You can try to groupby col0
and convert the grouped index to list
df['sameas'] = df['col0'].map(df.reset_index().groupby('col0')['index'].apply(list))
print(df)
col0 sameas
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
CodePudding user response:
try:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 5, size=(10, 1)), columns=['col0'])
df
'''
col0
0 3
1 4
2 2
3 4
4 4
5 1
6 2
7 2
8 2
9 4
'''
get a Series as mapping:
ser = df.groupby('col0').apply(lambda x: x.index.to_list())
ser
col0
1 [5]
2 [2, 6, 7, 8]
3 [0]
4 [1, 3, 4, 9]
dtype: object
use mapping:
df.assign(col1=df.col0.map(ser))
'''
col0 col1
0 3 [0]
1 4 [1, 3, 4, 9]
2 2 [2, 6, 7, 8]
3 4 [1, 3, 4, 9]
4 4 [1, 3, 4, 9]
5 1 [5]
6 2 [2, 6, 7, 8]
7 2 [2, 6, 7, 8]
8 2 [2, 6, 7, 8]
9 4 [1, 3, 4, 9]
'''
CodePudding user response:
On-liner method:
df['col1'] = [df[df.col0.values == i].index.tolist()for i in df.col0.values]
df
Output:
index | col0 | col1 |
---|---|---|
0 | 3 | 0 |
1 | 4 | 1,3,4,9 |
2 | 2 | 2,6,7,8 |
3 | 4 | 1,3,4,9 |
4 | 4 | 1,3,4,9 |
5 | 1 | 5 |
6 | 2 | 2,6,7,8 |
7 | 2 | 2,6,7,8 |
8 | 2 | 2,6,7,8 |
9 | 4 | 1,3,4,9 |