I came across a requirement where i need to concatenate list of items in column with other column values like below.
input:
ColumnA ColumnB
0 [x, y, z] Failed
1 [x, y] Passed
output:
ColumnA ColumnB ColumnC
0 [x, y, z] Failed [x:Failed, y:Failed, z:Failed]
1 [x, y] Passed [x:Passed, y:Passed]
Can some please let me know how can i achieve this using python and pandas.
CodePudding user response:
To use pandas only, you can explode
, concatenate, and transform to list:
df['ColumnC'] = (df.explode('ColumnA')
.assign(ColumnC=lambda d: d['ColumnA'] ':' d['ColumnB'])
.groupby(level=0)['ColumnC'].apply(list)
)
A faster solution would be to use itertools.product
:
from itertools import product
df['ColumnC'] = df.apply(lambda r: list(map(':'.join, product(r['ColumnA'], [r['ColumnB']]))), axis=1)
output:
ColumnA ColumnB ColumnC
0 [x, y, z] Failed [x:Failed, y:Failed, z:Failed]
1 [x, y] Passed [x:Passed, y:Passed]
CodePudding user response:
Use nested list comprehension with f-string
s in performance is important:
df = pd.DataFrame({"ColumnA": [list('xyz'), list('xy')],
"ColumnB": ['Failed', 'Passed']})
df['ColumnC'] = [[f'{z}:{y}' for z in x] for x, y in df[['ColumnA', 'ColumnB']].to_numpy()]
print (df)
ColumnA ColumnB ColumnC
0 [x, y, z] Failed [x:Failed, y:Failed, z:Failed]
1 [x, y] Passed [x:Passed, y:Passed]
Another idea, slowier is use apply
:
df['ColumnC'] = df.apply(lambda x: [f'{z}:{x["ColumnB"]}' for z in x['ColumnA']], axis=1)
Slowiest is use explode
with groupby
.
#test for 20k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [29]: %timeit (df.explode('ColumnA').assign(ColumnC=lambda d: d['ColumnA'] ':' d['ColumnB']).groupby(level=0)['ColumnC'].apply(list))
600 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]: %timeit df.apply(lambda r: list(map(':'.join, product(r['ColumnA'], [r['ColumnB']]))), axis=1)
268 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit [[f'{z}:{y}' for z in x] for x, y in df[['ColumnA', 'ColumnB']].to_numpy()]
36.4 ms ± 894 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [31]: %timeit df.apply(lambda x: [f'{z}:{x["ColumnB"]}' for z in x['ColumnA']], axis=1)
363 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)