Home > Enterprise >  Convert comma separated string type series into a list of integers pandas
Convert comma separated string type series into a list of integers pandas

Time:10-06

input csv file screenshot

My pandas dataframe has a column(series) with comma separated string values looking like this:

col_1,col_2,col_3
abc,123,49
bcd,234,"39,48"

I want to convert this col_3 into a list of integers. I tried

df["col_3"]=[[i] for i in df["col_3"]]

But I end up getting this result

col_1,col_2,col_3
abc,123,['49']
bcd,234,"['39,48']"

The desired output is

col_1,col_2,col_3
abc,123,[49]
bcd,234,[39,48]

I also want it to be fast because I would be using this for batches of 100k rows. Can some one suggest a solution . TIA

    col_1       col_2    col_3
0   F18  9994      495
1   C35  9995  390,483
3   F60  9998  497,468

CodePudding user response:

You can use pandas.Series.str.split with a list comprehension :

df['col_3'] = [[int(e) for e in x.split(",")] for x in df['col_3']]

# Output :

print(df)

  col_1  col_2     col_3
0   abc    123      [49]
1   bcd    234  [39, 48]

print(type(df.loc[0, 'col_3'][0]))
int

# Edit :

If you need to save your df to a .csv, there will be always double quotes between all the lists of integers (with a length > 1) since the elements of a list is separated by a comma and the default separator of pandas.DataFrame.to_csv is the comma as well. Double quotes are there to escape the commas inside these lists. So, to get rid of the double quotes, you have to choose another separator like the tab for example :

df.to_csv(r'path_to_your_new_csv', sep='\t')

CodePudding user response:

I am piggybacking on abokey's answer because they are already on the right track. Please let me know my dummy dataframe is not in your required format.

import pandas as pd
df = pd.DataFrame({"col1":["F18", "C35"], "col2":[9994, 9995], "col3":[43, '"39,48"']})

Using str.split would seperate your string elements into a list of substrings, based on your delimiter. he benefir of the appraoch is the whole string is put in a list, when the delimiter cannot be found as in df.loc[0, "col3"]. Afterwards, we apply list comprehension to change dtype of each element from str to int since it is your preferred format.

df["col3"] = df["col3"].apply(lambda x: [int(i) for i in str(x).strip('"').split(",")])

Edit: I updated the values based on your screenshot. use tqdm to see the progress when you are running on 100k rows.

!pip install tqdm
from tqdm import tqdm
tqdm.pandas()
df["col3"] = df["col3"].progress_apply(lambda x: [int(i) for i in str(x).strip('"').split(",")])

I can recommend pandarallel to parallel process through rows but it does not always work in my experience.

!pip install pandarallel
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers=4)
df["col3"] = df["col3"].parallel_apply(lambda x: [int(i) for i in str(x).strip('"').split(",")])
  • Related