For example you have a column like this below:
Column1 |
---|
adfghb, gad |
234rwfa |
ballbalba |
9adfad9, 5432 |
99a |
Expected output:
list1 = ["adfghb", "gad", "234rwfa", "ballbalba", "9adfad9", "5432", "99a"]
Datatype in the column is only string. I need efficient code since actual column is quite huge. I used for
loop, but takes way too long.
CodePudding user response:
You can use str
methods outside of Pandas:
>>> ', '.join(df['Column1']).split(', ')
['adfghb', 'gad', '234rwfa', 'ballbalba', '9adfad9', '5432', '99a']
Performance
For 25,000 rows:
# @MayankPorwal
%timeit df['Column1'].str.split(', ').explode().tolist()
9.99 ms ± 85.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# @jezrael
%timeit [y for x in df['Column1'] for y in x.split(', ')]
4.25 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# @Corralien
%timeit ', '.join(df['Column1']).split(', ')
2.24 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
Use Series.str.split
with Series.explode
:
In [1044]: l = df['Column1'].str.split(', ').explode().tolist()
In [1045]: l
Out[1045]: ['adfghb', 'gad', '234rwfa', 'ballbalba', '9adfad9', '5432', '99a']