Convert a Pandas series with strings to a list for creating encoded variables-CodePudding

I have a Pandas series with lists stored as strings. I'd like to convert strings to list and also deal with any bad data. Here's some data:

df['vals']

"['A']"
"['B', 'C, ', 'D', 'E']"
"['G', 'H', 'L', 'P', 'A, T']"

type(df['vals'][1]) 
str

Expected output:

df['vals']

['A']
['B', 'C', 'D', 'E']
['G', 'H', 'L', 'P', 'A, T']

type(df['vals'][1]) 
list

In case of any errors due to bad data, drop the element from the list or skip the string entirely. My objective is to have the data stored as lists, so that I can use .explode() to extract elements from the list and create new encoded variables.

CodePudding user response：

There are some misplaced commas in your sample input, but assuming they are correct in the read data, use this:

import ast
df["vals"] = df["vals"].apply(ast.literal_eval)

CodePudding user response：

Assuming you don't have complex input, one option would be to use a regex.

Just keep the word characters for example:

df = pd.DataFrame({'vals': ["['A']",
                            "['B', 'C, ', 'D', 'E']",
                            "['G', 'H', 'L', 'P', 'A, T']"]})
df['vals'].str.findall(r'\w ')

# or
# df['vals'].str.findall(r"[^\[\],' ] ")

output:

0                   [A]
1          [B, C, D, E]
2    [G, H, L, P, A, T]
Name: vals, dtype: object

output (last row only): ['G', 'H', 'L', 'P', 'A', 'T']