How to delete part of the JSON that is corrupted in a dataframe?-CodePudding

I have a dataframe which has one column with rows as json, and I am able to parse them correctly as long as a particular key is removed.

   id | email     | phone no | details
-------------------------------------------------
0  10 | [email protected] |   123   | {"a" : "hello", "b" : {"x": "whatever"....}, "c": "check"}
1  12 | [email protected] |   789   | {"a" : "bye", "b" : {"x": "ignore"....}, "c": "cool"}

The column details has a key called - "b", which has many key value pairs inside it and some are corrupted because there is comma or inverted commas missing. I don't care about it because I don't need it. Can I just removed that portion of the JSON, please?

I want it like below:

   id | email     | phone no | details
-------------------------------------------------
0  10 | [email protected] |   123   | {"a" : "hello", "c": "check"}
1  12 | [email protected] |   789   | {"a" : "bye", "c": "cool"}

I will need to spit that key/value in details into rows and columns for "details" and I am abae to do that if I remove that one corrupted key. I have millions of records so a way to ignore that key for all the rows in the "details" column is what I need.

Thanks.

CodePudding user response：

Try a regular expression with str.replace:

PAT = re.compile(r',\s*"b"\s*:\s*{.*?}\s*,\s*')
df['details'] = df['details'].str.replace(PAT, ', ')
print(df)

# Output:
   id      email  phone no                        details
0  10  [email protected]       123  {"a" : "hello", "c": "check"}
1  12  [email protected]       789     {"a" : "bye", "c": "cool"}

CodePudding user response：

This is gross and ugly, but may work if "b" always contains a dictionary, or at least curly braces filled with stuff that is not a closing curly brace:

import re
import json

# attempt to delete dictionaries associated with "b" key:
fixed = [re.sub(r'"b" ?: ?{[^}] }, ?', '', s) for s in df['details']]

try:
    # test for valid JSON
    [json.loads(f) for f in fixed]
    df['details'] = fixed
except json.JSONDecodeError:
    print('whoops, this ugly hack failed')