I have 70k files all of which look similar to this:
{'id': 24, 'name': None, 'city': 'City', 'region_id': 19,
'story_id': 1, 'description': 'text', 'uik': None, 'ustatus': 'status',
'wuiki_tik_name': '', 'reaction': None, 'reaction_official': '',
'created_at': '2011-09-07T07:24:44.420Z', 'lat': 54.7, 'lng': 20.5,
'regions': {'id': 19, 'name': 'name'}, 'stories': {'id': 1, 'name': '2011-12-04'}, 'assets': [], 'taggings': [{'tags': {'id': 6, 'name': 'name',
'tag_groups': {'id': 3, 'name': 'Violation'}}},
{'tags': {'id': 8, 'name': 'name', 'tag_groups': {'id': 5, 'name': 'resource'}}},
{'tags': {'id': 1, 'name': '01. Federal', 'tag_groups': {'id': 1, 'name': 'Level'}}},
{'tags': {'id': 3, 'name': '03. Local', 'tag_groups': {'id': 1, 'name': 'stuff'}}},
{'tags': {'id': 2, 'name': '02. Regional', 'tag_groups':
{'id': 1, 'name': 'Level'}}}], 'message_id': None, '_count': {'assets': 0, 'other_messages': 0, 'similars': 0, 'taggings': 5}}
The ultimate goal is to export it into a single CSV file. It can be successfully done without flattening. But since it has a lot of nested values, I would like to flatten it, and this is where I began facing problems related to data types. Here's the code:
import json
from pandas.io.json import json_normalize
import glob
path = glob.glob("all_messages/*.json")
for file in path:
with open(file, "r") as filer:
content = json.loads(json.dumps(filer.read()))
if content != 404:
df_main = json_normalize(content)
df_regions = json_normalize(content, record_path=['regions'], record_prefix='regions.', meta=['id'])
df_stories = json_normalize(content, record_path=['stories'], record_prefix='stories.', meta=['id'])
#... More code related to normalization
df_out.to_csv('combined_json.csv')
This code occasionally throws:
AttributeError: 'str' object has no attribute 'values'
or ValueError: DataFrame constructor not properly called!
. I realise that this is caused by json.dumps() JSON string output. However, I have failed to turn it into anything useable.
Any possible solutions to this?
CodePudding user response:
If you only need to change '
to "
:
...
for file in path:
with open(file, "r") as filer:
filer.replace("\'", "\"")
...
Making copies and using grep would be easier
CodePudding user response:
While it is not the solution I was initially expecting, this approach worked as well. I kept getting error messages related to the structure of the dict literals that were reluctant to become json, so I took the csv file that I wanted to normalise and worked with each column one by one:
df = pd.read_csv("combined_json.csv")
df['regions'] = df['regions'].apply(lambda x: x.replace("'", '"'))
regions = pd.json_normalize(df['regions'].apply(json.loads).tolist()).rename(
columns=lambda x: x.replace('regions.', ''))
df['regions'] = regions['name']
Or, if it had more nested levels:
df['taggings'] = df['taggings'].apply(lambda x: x.replace("'", '"'))
taggings = pd.concat([pd.json_normalize(json.loads(j)) for j in df['taggings']])
df = df.reset_index(drop=True)
taggings = taggings.reset_index(drop=True)
df[['tags_id', 'nametag', 'group_tag', 'group_tag_name']] = taggings[['tags.id', 'tags.name', 'tags.tag_groups.id', 'tags.tag_groups.name']]
Which was eventually df.to_csv()
.