Home > OS >  Normalize Semi-Structured JSON with Python
Normalize Semi-Structured JSON with Python

Time:08-11

I have been trying to normalize a JSON file to succesfully convert it to a CSV later, but i just cant get to work.

Im trying normalize it with pandas pd.json_normalize but it seems to be a very nested JSON.

This is the json structure:

    {
            "documentReport": {
                "documentId": "62f0f78a5e1428137492451c",
                "flowName": "sdsf",
                "flowStateName": "sent",
                "userName": "20304050607",
                "shortUrl": "vHppw8dm",
                "uploadDateUtc": "08/08/2022 08:46:18",
                "appId": "tucarpeta.com",
                "claro": {
                    "ctl_Id": "15836272",
                    "numberSds": "00441426TAE",
                    "numberOT": "12047839",
                    "entity": "193811",
                    "entityType": "432",
                    "channel": "1",
                    "presencial": 1,
                    "customer": {
                        "documentNumber": "23763407",
                        "name": "PAULA RAQUEL",
                        "lastName": "VELASQUEZ",
                        "cuit": null,
                        "sex": "F",
                        "email": "[email protected]"
                    }
                }
            },
            "claroEventKpiReportModel": {
                "deviceData": {
                    "operativeSystem": "Android 10 undefined",
                    "model": "SAMSUNG SM-J400M",
                    "device": "mobile",
                    "browser": "Samsung Internet for Android"
                },
                "eventData": {
                    "eventDateUtc": "08/08/2022 08:49:50",
                    "eventName": "IDOC_DOCUMENT_OPENED_BY_AVA",
                    "userName": "system"
                },
                "result": "Ok"
            }
        }

and so it goes on with the next record.

The output i need is the following:

documentId flowName flowStateName userName uploadDateUtc numberSds
62f0f78a5e1428137492451c sdsf sent 20304050607 08/08/2022 08:46:18 00441426TAE

And so with other records.

Can anyone help me how to use json_normalize to do so? Or how to get it done anyway. Thanks

CodePudding user response:

After converting literal null value that is not a string to np.nan, use

pd.DataFrame.from_dict(dict).T

CodePudding user response:

pd.json_normalize can't parse every JSON schema out there. You can preprocess the data before turning it to a data frame:

df = pd.DataFrame([{
    **{
        key: record["documentReport"][key]
        for key in ["documentId", "flowName", "flowStateName", "userName", "uploadDateUtc"]
    },
    "numberSds": record["documentReport"]["claro"]["numberSds"]
} for record in json_array])
  • Related