Let it be the following Python Panda DataFrame:
| ID | time | vector | other |
| ---- | ------- | ------------ | ----- |
| 0 | 81219 | [N1, N2, N3] | blue |
| 0 | 32323 | [N1, N4] | green |
| 1 | 423 | [N2, N3] | red |
| 1 | 4232 | [N1] | blue |
| 1 | 42424 | [N2, N2] | red |
| 2 | 42422 | [N3, N5] | blue |
I want to create a grouped DataFrame containing for each ID a column vector with the concatenation of the vectors for each element.
Example of the result:
| ID | vector |
| ---- | -------------------- |
| 0 | [N1, N2, N3, N1, N4] |
| 1 | [N2, N3, N1, N2, N2] |
| 2 | [N3, N5] |
CodePudding user response:
One option could be to use groupby.sum
:
out = df.groupby('ID', as_index=False)['vector'].sum()
However, if you have many lists, a repeated sum will be quadratic.
In this case, prefer to use itertools.chain
:
from itertools import chain
out = (df.groupby('ID', as_index=False)['vector']
.agg(lambda x: list(chain.from_iterable(x)))
)
output:
ID vector
0 0 [N1, N2, N3, N1, N4]
1 1 [N2, N3, N1, N2, N2]
2 2 [N3, N5]
timing
Let's use 10k repeats of the example and compare sum
and itertools.chain
:
df = pd.DataFrame({'ID': [0, 0, 1, 1, 1, 2],
'vector': [['N1', 'N2', 'N3'], ['N1', 'N4'], ['N2', 'N3'],
['N1'], ['N2', 'N2'], ['N3', 'N5']]})
df = pd.concat([df]*10000, ignore_index=True)
%%timeit
df.groupby('ID', as_index=False)['vector'].sum()
# this is very slow!
# 1.86 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
from itertools import chain
df.groupby('ID', as_index=False)['vector'].agg(lambda x: list(chain.from_iterable(x)))
# much faster!
# 8.02 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)