Use ast.literal_eval on all columns of a Pandas Dataframe-CodePudding

I have a data frame that looks like the following:

Category            Class
==========================
['org1', 'org2']    A
['org2', 'org3']    B
org1                C
['org3', 'org4']    A
org2                A
['org2', 'org4']    B
...

When I read in this data using Pandas, the lists are read in as strings (e.g., dat['Category][0][0] returns [ rather than returning org1). I have several columns like this. I want every categorical column that already contains at least one list to have all records be a list. For example, the above data frame should look like the following:

Category            Class
==========================
['org1', 'org2']    A
['org2', 'org3']    B
['org1']            C
['org3', 'org4']    A
['org2']            A
['org2', 'org4']    B
...

Notice how the singular values in the Category column are now contained in lists. When I reference dat['Category][0][0], I'd like org1 to be returned.

What is the best way to accomplish this? I was thinking of using ast.literal_eval with an apply and lambda function, but I'd like to try and use best-practices if possible. Thanks in advance!

CodePudding user response：

You can do it like this:

df['Category'] = df['Category'].apply(lambda x: literal_eval(x) if x.startswith('[') else [x])

CodePudding user response：

You could create a boolean mask of the values that need to changed. If there are no lists, no change is needed. If there are lists, you can apply literal_eval or a list creation lambda to subsets of the data.

import ast
import pandas as pd

def normalize_category(df):
    is_list = df['Category'].str.startswith('[')
    if is_list.any():
        df.loc[is_list,'Category'] = df.loc[is_list, 'Category'].apply(ast.literal_eval)
        df.loc[~is_list,'Category'] = df.loc[~is_list]['Category'].apply(lambda val: [val])

df = pd.DataFrame({"Category":["['org1', 'org2']", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)

df = pd.DataFrame({"Category":["org2", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)