Concatenate list items with other column values-CodePudding

I came across a requirement where i need to concatenate list of items in column with other column values like below.

input:

     ColumnA ColumnB
0  [x, y, z]  Failed
1     [x, y]  Passed

output:

     ColumnA ColumnB                         ColumnC
0  [x, y, z]  Failed  [x:Failed, y:Failed, z:Failed]
1     [x, y]  Passed            [x:Passed, y:Passed]

Can some please let me know how can i achieve this using python and pandas.

CodePudding user response：

To use pandas only, you can explode, concatenate, and transform to list:

df['ColumnC'] = (df.explode('ColumnA')
                   .assign(ColumnC=lambda d: d['ColumnA'] ':' d['ColumnB'])
                   .groupby(level=0)['ColumnC'].apply(list)
                )

A faster solution would be to use itertools.product:

from itertools import product
df['ColumnC'] = df.apply(lambda r: list(map(':'.join, product(r['ColumnA'], [r['ColumnB']]))), axis=1)

output:

     ColumnA ColumnB                         ColumnC
0  [x, y, z]  Failed  [x:Failed, y:Failed, z:Failed]
1     [x, y]  Passed            [x:Passed, y:Passed]

CodePudding user response：

Use nested list comprehension with f-strings in performance is important:

df = pd.DataFrame({"ColumnA": [list('xyz'), list('xy')],
                   "ColumnB": ['Failed', 'Passed']})
    
df['ColumnC'] = [[f'{z}:{y}' for z in x] for x, y in df[['ColumnA', 'ColumnB']].to_numpy()]
print (df)
     ColumnA ColumnB                         ColumnC
0  [x, y, z]  Failed  [x:Failed, y:Failed, z:Failed]
1     [x, y]  Passed            [x:Passed, y:Passed]

Another idea, slowier is use apply:

df['ColumnC'] = df.apply(lambda x: [f'{z}:{x["ColumnB"]}' for z in x['ColumnA']], axis=1)

Slowiest is use explode with groupby.

#test for 20k rows
df = pd.concat([df] * 10000, ignore_index=True)
    

In [29]: %timeit (df.explode('ColumnA').assign(ColumnC=lambda d: d['ColumnA'] ':' d['ColumnB']).groupby(level=0)['ColumnC'].apply(list))
600 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]: %timeit df.apply(lambda r: list(map(':'.join, product(r['ColumnA'], [r['ColumnB']]))), axis=1)
268 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]: %timeit [[f'{z}:{y}' for z in x] for x, y in df[['ColumnA', 'ColumnB']].to_numpy()]
36.4 ms ± 894 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [31]: %timeit df.apply(lambda x: [f'{z}:{x["ColumnB"]}' for z in x['ColumnA']], axis=1)
363 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)