I have a dataframe which contains array column and string column
| string_col | array_col |
|-------------|----------------------|
| fruits | ['apple', 'banaana'] |
| flowers | ['rose', 'sunflower']|
| animals | ['lion', 'tiger'] |
I want to assign string_col elements to each element in array_col. So, the output dataframe which is in the form of below.
| string_col | array_col | new_col |
|-------------|----------------------|----------------------|
| fruits | ['apple', 'banaana'] |['fruits', 'fruits'] |
| flowers | ['rose', 'sunflower']|['flowers', 'flowers']|
| animals | ['lion', 'tiger'] |['animals', 'animals']|
CodePudding user response:
Use list comprehension for repeat string
s by length of column:
df['new_col'] = [[a] * len(b) for a, b in zip(df['string_col'], df['array_col'])]
print (df)
string_col array_col new_col
0 fruits [apple, banaana] [fruits, fruits]
1 flowers [rose, sunflower] [flowers, flowers]
2 animals [lion, tiger] [animals, animals]
If small data and performance not important use DataFrame.apply
:
df['new_col'] = df.apply(lambda x: [x['string_col']] * len(x['array_col']) , axis=1)
#3k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [311]: %timeit df['new_col'] = [[a] * len(b) for a, b in zip(df['string_col'], df['array_col'])]
1.94 ms ± 97.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [312]: %timeit df['new_col'] = df.apply(lambda x: [x['string_col']] * len(x['array_col']) , axis=1)
40.4 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [313]: %timeit df['new_col']=df[['string_col']].agg(list, axis=1)*df['array_col'].str.len()
132 ms ± 6.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)