Make a column of json list to separate columns of values-CodePudding

I have the following dataframe where one of the columns is a json list and has nan values.

     ID | random_column
    ---------------
     1 | "[{'key1': 'abc', 'key2': 'xyz', 'key3':'pqr'}, {'key1': 'abc2', 'key2': 'xyz2', 'key3':'pqr2'}, {'key1': 'abc3', 'key2': 'xyz3', 'key3':'pqr3'}]"
     2 | "[{'key1': 'abc', 'key2': 'xyz', 'key3':'pqr'}]"
     3 | "[{'key1': 'abc', 'key2': 'xyz', 'key3':'pqr'}, {'key1': 'abc', 'key2': 'xyz', 'key3':'pqr'}]"
     4 | nan
     5 | nan
     6 | nan

I want to extract only few fields from the json and make them a separate column in alist format

     ID | key1_list               | Key3_list
    -------------------------------
     1 | "['abc', 'abc2','abc3']" | "['pqr', 'pqr2','pqr3']"
     2 | "['abc']"                | "['pqr']"
     3 | "['abc','abc']"          | "['pqr', 'pqr']"
     4 | "[]"                     | "[]"
     5 | "[]"                     | "[]"
     6 | "[]"                     | "[]"

How do i proceed with this using pandas and python. Thanks

CodePudding user response：

Use custom function only for non missing values for convert strings repr to list of dicts by ast.literal_eval, then create dictionary of lists with Series constructor, last add empty lists by DataFrame.reindex and join to original ID column by DataFrame.join:

import ast

#if nan are missing values
m = df['random_column'].notna()

#if nan are strings 'nan'
m = df['random_column'].eq('nan')


need = ['key1','key3']
def f(x):
    D = ast.literal_eval(x)
    return pd.Series({k: [dic[k] for dic in D] for k in D[0] if k in need}).add_suffix('_list')

df = df[['ID']].join(df.loc[m, 'random_column'].apply(f).reindex(df.index, fill_value=[]))
print (df)
   ID          key1_list          key3_list
0   1  [abc, abc2, abc3]  [pqr, pqr2, pqr3]
1   2              [abc]              [pqr]
2   3         [abc, abc]         [pqr, pqr]
3   4                 []                 []
4   5                 []                 []
5   6                 []                 []

Another solution:

need = ['key1','key3']
suffix = '_list'

def f(x):
    try:
        D = ast.literal_eval(x)
        return pd.Series({k: [dic[k] for dic in D] for k in D[0] if k in need})
    except:
        return pd.Series([[]] * len(need), index=need)

df = df[['ID']].join(df['random_column'].apply(f).add_suffix(suffix))
print (df)
   ID          key1_list          key3_list
0   1  [abc, abc2, abc3]  [pqr, pqr2, pqr3]
1   2              [abc]              [pqr]
2   3         [abc, abc]         [pqr, pqr]
3   4                 []                 []
4   5                 []                 []
5   6                 []                 []

CodePudding user response：

You can use ast.literal_eval to parse the strings, then use dict.get method to get the "key1" and "key3" values for each dict, unpack and zip, map to lists and finally cast to a DataFrame constructor:

import ast    
s = df['random_column'].apply(lambda x: list(map(list, zip(*[(d.get('key1', []), d.get('key3', [])) 
                                                             for d in ast.literal_eval(x)]))) 
                              if pd.notna(x) else [[], []])
df[['key1_list', 'key3_list']] = pd.DataFrame(s.tolist()).astype(str)
df = df.drop(columns='random_column')

Output:

   ID                key1_list                key3_list
0   1  ['abc', 'abc2', 'abc3']  ['pqr', 'pqr2', 'pqr3']
1   2                  ['abc']                  ['pqr']
2   3           ['abc', 'abc']           ['pqr', 'pqr']
3   4                       []                       []
4   5                       []                       []
5   6                       []                       []

Note that what you have is not valid json since json strings begin and end with quotation mark ".