Home > Enterprise >  Convert JSON with dictionaries into pandas Dataframe (AWS)
Convert JSON with dictionaries into pandas Dataframe (AWS)

Time:05-11

I have an Amazon serverless Aurora SQL database instance with some debt installments data. I was trying to connect on the DB with AWS Lambda (python 3.7) and found this method:

import boto3
rds_client = boto3.client('rds-data')

database_name = 'dbname'
db_cluster_arn = 'arn:aws:rds:us-east-1:xxxx:cluster:xxxx'
db_credentials_secrets_store_arn = 'arn:aws:secretsmanager:us-east-1:xxxx:secret:rds-db-credentials/cluster-xxxx/'

def lambda_handler(event, context):
    response = execute_statement('SELECT * FROM focafidc.estoque');
    json_string = str(response)
    return response

def execute_statement(sql):
    response = rds_client.execute_statement(
        secretArn=db_credentials_secrets_store_arn,
        database=database_name,
        resourceArn=db_cluster_arn,
        sql=sql
    )
    return response;

The response returns something like a JSON with data nested in dictionaries:

{
  "ResponseMetadata": {
    "RequestId": "f7df6de2-8144-4b7b-9cf0-c828454b4a0d",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "f7df6de2-8144-4b7b-9cf0-c828454b4a0d",
  "content-type": "application/json",
  "content-length": "324685",
  "date": "Tue, 10 May 2022 13:51:57 GMT"
},
"RetryAttempts": 0
  },
  "numberOfRecordsUpdated": 0,
  "records": [
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 1
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2023-05-02"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
],
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 2
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2024-05-01"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
],
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 3
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2025-05-01"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
],
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 4
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2026-05-01"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
],
[
  {
    "stringValue": "2022-05-02"
  },
  {
    "longValue": 1
  },
  {
    "longValue": 5
  },
  {
    "stringValue": "a3789"
  },
  {
    "stringValue": "519.60"
  },
  {
    "stringValue": "2027-05-01"
  },
  {
    "stringValue": "2598.00"
  },
  {
    "longValue": 666000002
  },
  {
    "stringValue": "1.88"
  },
  {
    "stringValue": "b190"
  },
  {
    "stringValue": "1996-03-25"
  },
  {
    "stringValue": "Brasileiro"
  },
  {
    "stringValue": "masculino"
  },
  {
    "stringValue": "false"
  },
  {
    "stringValue": "SP"
  },
  {
    "stringValue": "São Paulo"
  },
  {
    "longValue": 111111111
  },
  {
    "longValue": 1111111111
  },
  {
    "booleanValue": true
  },
  {
    "stringValue": "LOJAS S.A."
  },
  {
    "stringValue": "99999999999999"
  }
    ]
  ]
}

I need this data to be a pandas dataframe, so I tried to json_normalize the response JSON and got the following result:

bd1 = pd.json_normalize(response,['records'])
print(bd1)

0   ...                                 20
0    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}
1    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}
2    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}
3    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}
4    {'stringValue': '2022-05-02'}  ...  {'stringValue': '99999999999999'}

Can you guys suggest any method to create or convert this to an only values Dataframe?

CodePudding user response:

How about we first parsing the records to standard python objects, and then we handle the JSON-like python structure to dataframe. Assume that you've already parsed the records to nested-list of dicts like the following:

true, false, null = True, False, None

records = [
    [
        {
            "stringValue": "2022-05-02"
        },
        {
            "longValue": 1
        },
        {
            "longValue": 1
        },
        {
            "stringValue": "a3789"
        },
        {
            "stringValue": "519.60"
        },
        {
            "stringValue": "2023-05-02"
        },
        {
            "stringValue": "2598.00"
        },
        {
            "longValue": 666000002
        },
        {
            "stringValue": "1.88"
        },
        {
            "stringValue": "b190"
        },
        {
            "stringValue": "1996-03-25"
        },
        {
            "stringValue": "Brasileiro"
        },
        {
            "stringValue": "masculino"
        },
        {
            "stringValue": "false"
        },
        {
            "stringValue": "SP"
        },
        {
            "stringValue": "São Paulo"
        },
        {
            "longValue": 111111111
        },
        {
            "longValue": 1111111111
        },
        {
            "booleanValue": true
        },
        {
            "stringValue": "LOJAS S.A."
        },
        {
            "stringValue": "99999999999999"
        }
    ],
    [
        {
            "stringValue": "2022-05-03"
        },
        {
            "longValue": 1
        },
        {
            "longValue": 2
        },
        {
            "stringValue": "a3789"
        },
        {
            "stringValue": "519.60"
        },
        {
            "stringValue": "2024-05-01"
        },
        {
            "stringValue": "2598.00"
        },
        {
            "longValue": 666000002
        },
        {
            "stringValue": "1.88"
        },
        {
            "stringValue": "b190"
        },
        {
            "stringValue": "1996-03-25"
        },
        {
            "stringValue": "Brasileiro"
        },
        {
            "stringValue": "masculino"
        },
        {
            "stringValue": "false"
        },
        {
            "stringValue": "SP"
        },
        {
            "stringValue": "São Paulo"
        },
        {
            "longValue": 111111111
        },
        {
            "longValue": 1111111111
        },
        {
            "booleanValue": true
        },
        {
            "stringValue": "LOJAS S.A."
        },
        {
            "stringValue": "99999999999999"
        }
    ],
]

Here we start to extract the values you focus on:

def first(seq):
    return next(iter(seq))


import pandas as pd

records_values = [[first(item.values()) for item in record] for record in records]
df = pd.DataFrame(records_values)
print(df)

The Output is:

           0   1   2      3   ...          17    18          19              20
0  2022-05-02   1   1  a3789  ...  1111111111  True  LOJAS S.A.  99999999999999
1  2022-05-03   1   2  a3789  ...  1111111111  True  LOJAS S.A.  99999999999999

[2 rows x 21 columns]

And if you want to keep the correspond value types, you can extract the value types from one item of the records like this, and do the related type-casting later in pandas:

candidate = records[0]
value_types = [first(item.keys()) for item in candidate]
# ['stringValue', 'longValue', 'longValue', 'stringValue', 'stringValue', 'stringValue', 'stringValue',
# 'longValue', 'stringValue', 'stringValue', 'stringValue', 'stringValue', 'stringValue', 'stringValue',
# 'stringValue', 'stringValue', 'longValue', 'longValue', 'booleanValue', 'stringValue', 'stringValue']
  • Related