Replace single quotes with doubles to turn contents of a file into a nested JSON and normalize it af-CodePudding

I have 70k files all of which look similar to this:

{'id': 24, 'name': None, 'city': 'City', 'region_id': 19,
'story_id': 1, 'description': 'text', 'uik': None, 'ustatus': 'status',
'wuiki_tik_name': '', 'reaction': None, 'reaction_official': '',
'created_at': '2011-09-07T07:24:44.420Z', 'lat': 54.7, 'lng': 20.5,
'regions': {'id': 19, 'name': 'name'}, 'stories': {'id': 1, 'name': '2011-12-04'}, 'assets': [], 'taggings': [{'tags': {'id': 6, 'name': 'name',
'tag_groups': {'id': 3, 'name': 'Violation'}}},
{'tags': {'id': 8, 'name': 'name', 'tag_groups': {'id': 5, 'name': 'resource'}}},
{'tags': {'id': 1, 'name': '01. Federal', 'tag_groups': {'id': 1, 'name': 'Level'}}},
{'tags': {'id': 3, 'name': '03. Local', 'tag_groups': {'id': 1, 'name': 'stuff'}}},
{'tags': {'id': 2, 'name': '02. Regional', 'tag_groups':
{'id': 1, 'name': 'Level'}}}], 'message_id': None, '_count': {'assets': 0, 'other_messages': 0, 'similars': 0, 'taggings': 5}}

The ultimate goal is to export it into a single CSV file. It can be successfully done without flattening. But since it has a lot of nested values, I would like to flatten it, and this is where I began facing problems related to data types. Here's the code:

import json
from pandas.io.json import json_normalize
import glob

path = glob.glob("all_messages/*.json")
for file in path:
    with open(file, "r") as filer:
        content = json.loads(json.dumps(filer.read()))
        if content != 404:
            df_main = json_normalize(content)
            df_regions = json_normalize(content, record_path=['regions'], record_prefix='regions.', meta=['id'])
            df_stories = json_normalize(content, record_path=['stories'], record_prefix='stories.', meta=['id'])
            #... More code related to normalization

df_out.to_csv('combined_json.csv')

This code occasionally throws: AttributeError: 'str' object has no attribute 'values' or ValueError: DataFrame constructor not properly called!. I realise that this is caused by json.dumps() JSON string output. However, I have failed to turn it into anything useable.

Any possible solutions to this?

CodePudding user response：

If you only need to change ' to ":

...
for file in path:
    with open(file, "r") as filer:
        filer.replace("\'", "\"")
...

Making copies and using grep would be easier

CodePudding user response：

While it is not the solution I was initially expecting, this approach worked as well. I kept getting error messages related to the structure of the dict literals that were reluctant to become json, so I took the csv file that I wanted to normalise and worked with each column one by one:

df = pd.read_csv("combined_json.csv")
df['regions'] = df['regions'].apply(lambda x: x.replace("'", '"'))
regions = pd.json_normalize(df['regions'].apply(json.loads).tolist()).rename(
    columns=lambda x: x.replace('regions.', ''))
df['regions'] = regions['name']

Or, if it had more nested levels:

df['taggings'] = df['taggings'].apply(lambda x: x.replace("'", '"'))
taggings = pd.concat([pd.json_normalize(json.loads(j)) for j in df['taggings']])
df = df.reset_index(drop=True)
taggings = taggings.reset_index(drop=True)
df[['tags_id', 'nametag', 'group_tag', 'group_tag_name']] = taggings[['tags.id', 'tags.name', 'tags.tag_groups.id', 'tags.tag_groups.name']]

Which was eventually df.to_csv().