I have the following dataframe where one of the columns is a json list and has nan values.
ID | random_column
---------------
1 | "[{'key1': 'abc', 'key2': 'xyz', 'key3':'pqr'}, {'key1': 'abc2', 'key2': 'xyz2', 'key3':'pqr2'}, {'key1': 'abc3', 'key2': 'xyz3', 'key3':'pqr3'}]"
2 | "[{'key1': 'abc', 'key2': 'xyz', 'key3':'pqr'}]"
3 | "[{'key1': 'abc', 'key2': 'xyz', 'key3':'pqr'}, {'key1': 'abc', 'key2': 'xyz', 'key3':'pqr'}]"
4 | nan
5 | nan
6 | nan
I want to extract only few fields from the json and make them a separate column in alist format
ID | key1_list | Key3_list
-------------------------------
1 | "['abc', 'abc2','abc3']" | "['pqr', 'pqr2','pqr3']"
2 | "['abc']" | "['pqr']"
3 | "['abc','abc']" | "['pqr', 'pqr']"
4 | "[]" | "[]"
5 | "[]" | "[]"
6 | "[]" | "[]"
How do i proceed with this using pandas and python. Thanks
CodePudding user response:
Use custom function only for non missing values for convert strings repr to list of dicts by ast.literal_eval
, then create dictionary of lists with Series
constructor, last add empty lists by DataFrame.reindex
and join to original ID
column by DataFrame.join
:
import ast
#if nan are missing values
m = df['random_column'].notna()
#if nan are strings 'nan'
m = df['random_column'].eq('nan')
need = ['key1','key3']
def f(x):
D = ast.literal_eval(x)
return pd.Series({k: [dic[k] for dic in D] for k in D[0] if k in need}).add_suffix('_list')
df = df[['ID']].join(df.loc[m, 'random_column'].apply(f).reindex(df.index, fill_value=[]))
print (df)
ID key1_list key3_list
0 1 [abc, abc2, abc3] [pqr, pqr2, pqr3]
1 2 [abc] [pqr]
2 3 [abc, abc] [pqr, pqr]
3 4 [] []
4 5 [] []
5 6 [] []
Another solution:
need = ['key1','key3']
suffix = '_list'
def f(x):
try:
D = ast.literal_eval(x)
return pd.Series({k: [dic[k] for dic in D] for k in D[0] if k in need})
except:
return pd.Series([[]] * len(need), index=need)
df = df[['ID']].join(df['random_column'].apply(f).add_suffix(suffix))
print (df)
ID key1_list key3_list
0 1 [abc, abc2, abc3] [pqr, pqr2, pqr3]
1 2 [abc] [pqr]
2 3 [abc, abc] [pqr, pqr]
3 4 [] []
4 5 [] []
5 6 [] []
CodePudding user response:
You can use ast.literal_eval
to parse the strings, then use dict.get
method to get the "key1" and "key3" values for each dict, unpack and zip
, map to lists and finally cast to a DataFrame constructor:
import ast
s = df['random_column'].apply(lambda x: list(map(list, zip(*[(d.get('key1', []), d.get('key3', []))
for d in ast.literal_eval(x)])))
if pd.notna(x) else [[], []])
df[['key1_list', 'key3_list']] = pd.DataFrame(s.tolist()).astype(str)
df = df.drop(columns='random_column')
Output:
ID key1_list key3_list
0 1 ['abc', 'abc2', 'abc3'] ['pqr', 'pqr2', 'pqr3']
1 2 ['abc'] ['pqr']
2 3 ['abc', 'abc'] ['pqr', 'pqr']
3 4 [] []
4 5 [] []
5 6 [] []
Note that what you have is not valid json since json strings begin and end with quotation mark "
.