Home > OS >  nested json from s3 to dataframe with pandas
nested json from s3 to dataframe with pandas


I'm struggling to unnest this json, pulling from s3, and store only parts of it within a dataframe.

here is the structure

import boto3
import json

s3 = boto3.resource('s3')
dat = []
content_object = s3.Object(FROM_BUCKET, key['Key'])
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)

{'twts': {'101861193645447': {'aiScrs': [{'lfeEvtId': 5,
     'orgScr': 0.779,
     'adjScr': 0.3865,
     'lstScrUtc': '2021-02-24T22:14:17.8420665Z',
     'lstScrYmd': '2021-02-24'}]},
  '100300192097235': {'aiScrs': [{'lfeEvtId': 5,
     'orgScr': 0.765,
     'adjScr': 0.365,
     'lstScrUtc': '2021-02-24T22:14:17.8420665Z',
     'lstScrYmd': '2021-02-24'}]},
  '100179311336977': {'aiScrs': [{'lfeEvtId': 5,
     'orgScr': 0.732,
     'adjScr': 0.332,
     'lstScrUtc': '2021-02-24T22:14:17.8420665Z',
     'lstScrYmd': '2021-02-24'}]}}}

here is my attempt

dat =[]
response = s3_c.get_object(Bucket=FROM_BUCKET, Key=key['Key'])
df_dat = pd.read_json(response['Body'],convert_axes=False)
dat = pd.json_normalize(data=df_dat)


100179311336977 {'aiScrs': [{'lfeEvtId': 5, 'orgScr': 0.732, 'adjScr': 0.332, 'lstScrUtc': '2021-02-24T22:14:17.8420665Z', 'lstScrYmd': '2022-02-24'}]}
100300192097235 {'aiScrs': [{'lfeEvtId': 5, 'orgScr': 0.765, 'adjScr': 0.365, 'lstScrUtc': '2021-02-24T22:14:17.8420665Z', 'lstScrYmd': '2022-02-24'}]}
101861193645447 {'aiScrs': [{'lfeEvtId': 5, 'orgScr': 0.779, 'adjScr': 0.3865, 'lstScrUtc': '2021-02-24T22:14:17.8420665Z', 'lstScrYmd': '2022-02-24'}]}

this last part errors out 

AttributeError                            Traceback (most recent call last)
<ipython-input-83-0d22f901897d> in <module>
      4 df_dat = pd.read_json(response['Body'],convert_axes=False)
      5 df_dat
----> 6 dat = pd.json_normalize(data=df_dat)
      7 # dat = pd.json_normalize(data=df_dat, record_path=['aiScrs'])
      8 dat

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    269     if record_path is None:
--> 270         if any([isinstance(x, dict) for x in y.values()] for y in data):
    271             # naive normalization, this is idempotent for flat records
    272             # and potentially will inflate the data considerably for

~/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/pandas/io/json/_normalize.py in <genexpr>(.0)
    269     if record_path is None:
--> 270         if any([isinstance(x, dict) for x in y.values()] for y in data):
    271             # naive normalization, this is idempotent for flat records
    272             # and potentially will inflate the data considerably for

AttributeError: 'str' object has no attribute 'values'

it errors out when i try to manipulate it in anyway, including

dat = pd.json_normalize(data=df_dat, record_path=['aiScrs'])

i'm trying to get out 3 rows, with all the below columns

ID   lfeEvtId orgScr adjScr lstScrUtc lstScrYmd

i cannot seem to find a way to do this (with json_normalize would be preferrable)

CodePudding user response:

First, some list-comphrension to shape json_content into a more usable structure. Then pd.json_normalize is simple to use

tweet_json_list = [{'id': k, **v} for k, v in json_content['twts'].items()]
df = pd.json_normalize(tweet_json_list, record_path='aiScrs', meta=['id'])


>>> df
   lfeEvtId  orgScr  adjScr                     lstScrUtc   lstScrYmd               id
0         5   0.779  0.3865  2021-02-24T22:14:17.8420665Z  2021-02-24  101861193645447
1         5   0.765  0.3650  2021-02-24T22:14:17.8420665Z  2021-02-24  100300192097235
2         5   0.732  0.3320  2021-02-24T22:14:17.8420665Z  2021-02-24  100179311336977
  • Related