I have a dataframe column that contain dictionary like strings. it comes in two ways. first option is dictionary within string such as '{"d":11,"g":0.8,"r":45}'. second option is like that: '{d:18, g:0.1, r:75, f:6}'. The data frame is several millions rows. I do not know in which row appear first or second option.
df_initial = a b c kind
0 0.50 bibi 23 '{"d":11,"g":0.8,"r":45}'
1 0.80 cici 140 '{d:18, g:0.1, r:75, f:6}'
2 0.01 didi 320 '{"d":101,"g":0.05,"r":32}'
3 0.12 mimi 3 '{d:41,g:0.26,r:64}'
the desired dataframe
df_final = a b c kind d g r f
0 0.50 bibi 23 '{'d':11, 'g':0.8, 'r':45}' 11 0.80 45 NaN
1 0.80 cici 140 '{d:18, g:0.1, r:75, f:6}' 18 0.10 75 6.0
2 0.01 didi 320 '{'d':101, 'g':0.05, 'r':32}' 101 0.05 32 NaN
3 0.12 mimi 3 '{d:41, g:0.26, r:64}' 41 0.26 64 NaN
CodePudding user response:
Because in sample data missing ''
in keys like {d:18, g:0.1, r:75, f:6}
for me json.loads
and ast.literal_eval
failed, so used:
L=[dict(y.split(':') for y in x.strip("'{} ").replace('"','').replace(', ',',').split(','))
for x in df['kind']]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
a b c kind d g r f
0 0.50 bibi 23 '{"d":11,"g":0.8,"r":45}' 11 0.8 45 NaN
1 0.80 cici 140 '{d:18, g:0.1, r:75, f:6}' 18 0.1 75 6
2 0.01 didi 320 '{"d":101,"g":0.05,"r":32}' 101 0.05 32 NaN
3 0.12 mimi 3 '{d:41,g:0.26,r:64}' 41 0.26 64 NaN
CodePudding user response:
You could pandas.json_normalize to convert the kind
(has been json
parsed) column to a dataframe, keep in mind that you might have many different keys in the dictionary.
import json
df = pd.DataFrame([('a', '{"d":11,"g":0.8,"r":45}'),], columns=['a', 'kind'])
In [6]: pd.json_normalize(df['kind'].apply(json.loads))
Out[6]:
d g r
0 11 0.8 45
you can concat this new dataframe to the org along axis=columns/0
to get what you want
In [11]: pd.concat([df, pd.json_normalize(df['kind'].apply(json.loads))], axis='columns')
Out[11]:
a kind d g r
0 a {"d":11,"g":0.8,"r":45} 11 0.8 45