I have a dataframe which has one column with rows as json, and I am able to parse them correctly as long as a particular key is removed.
id | email | phone no | details
-------------------------------------------------
0 10 | [email protected] | 123 | {"a" : "hello", "b" : {"x": "whatever"....}, "c": "check"}
1 12 | [email protected] | 789 | {"a" : "bye", "b" : {"x": "ignore"....}, "c": "cool"}
The column details has a key called - "b", which has many key value pairs inside it and some are corrupted because there is comma or inverted commas missing. I don't care about it because I don't need it. Can I just removed that portion of the JSON, please?
I want it like below:
id | email | phone no | details
-------------------------------------------------
0 10 | [email protected] | 123 | {"a" : "hello", "c": "check"}
1 12 | [email protected] | 789 | {"a" : "bye", "c": "cool"}
I will need to spit that key/value in details into rows and columns for "details" and I am abae to do that if I remove that one corrupted key. I have millions of records so a way to ignore that key for all the rows in the "details" column is what I need.
Thanks.
CodePudding user response:
Try a regular expression with str.replace
:
PAT = re.compile(r',\s*"b"\s*:\s*{.*?}\s*,\s*')
df['details'] = df['details'].str.replace(PAT, ', ')
print(df)
# Output:
id email phone no details
0 10 [email protected] 123 {"a" : "hello", "c": "check"}
1 12 [email protected] 789 {"a" : "bye", "c": "cool"}
CodePudding user response:
This is gross and ugly, but may work if "b" always contains a dictionary, or at least curly braces filled with stuff that is not a closing curly brace:
import re
import json
# attempt to delete dictionaries associated with "b" key:
fixed = [re.sub(r'"b" ?: ?{[^}] }, ?', '', s) for s in df['details']]
try:
# test for valid JSON
[json.loads(f) for f in fixed]
df['details'] = fixed
except json.JSONDecodeError:
print('whoops, this ugly hack failed')