python problems reading correctly a nested JSON file-CodePudding

I'm having trouble reading correctly a nested JSON file into a dataframe. This a sample of the json file with pharmaceutical products I'm working on:

[
    [
        {
            "ScrapingOriginIdentifier": "N",
            "ActiveSubstances": [
                "A.C.T.H. pour préparations homéopathiques"
            ],
            "ATC": null,
            "Name": "A.C.T.H. BOIRON, degré de dilution compris entre 4CH et 30CH ou entre 8DH et 60DH",
            "OtherFields": [
                {
                    "Name": null,
                    "Value": "CIS: 6 499 638 6",
                    "Type": "string"
                },
                {
                    "Name": null,
                    "Value": "MA Holder since: 06/10/2021",
                    "Type": "string"
                }
            ],
            "Package": "1 tube de 4 g de granules",
            "PharmaceuticalForm": "Granules",
        },
        {
            "ScrapingOriginIdentifier": "N",
            "ActiveSubstances": [
                "A.C.T.H. pour préparations homéopathiques"
            ],
            "ATC": null,
            "Name": "A.C.T.H. BOIRON, degré de dilution compris entre 4CH et 30CH ou entre 8DH et 60DH",
            "OtherFields": [
                {
                    "Name": null,
                    "Value": "CIS: 6 499 638 6",
                    "Type": "string"
                },
                {
                    "Name": null,
                    "Value": "MA Holder since: 06/10/2021",
                    "Type": "string"
                }
            ],
            "Package": "1 tube de 20 g de pommade",
            "PharmaceuticalForm": "Granules",
        }
    ],
    [
        {
            "ScrapingOriginIdentifier": "34009 341 687 6 5",
            "ActiveSubstances": [],
            "ATC": null,
            "Name": "17 B ESTRADIOL BESINS-ISCOVESCO 0,06 POUR CENT, gel pour application cutanée en tube",
            "OtherFields": [
                {
                    "Name": null,
                    "Value": "CIS: 6 858 620 3",
                    "Type": "string"
                },
                {
                    "Name": null,
                    "Value": "Codes: 34009 341 687 6 5 or 341 687-6",
                    "Type": "string"
                }
            ],
            "Package": "1 tube(s) aluminium verni de 80 g avec applicateur polystyrène",
            "PharmaceuticalForm": "Gel",
        }
    ]
]

I can see the problem is that it's nested by ScrapingOriginIdentifier. I read the file using:

dataset = pd.read_json('data.json', orient='records')

And tried to 'shape' it correctly using:

dataset = pd.json_normalize(dataset)

This still did not work. How can I read the file correctly in order to get all?

CodePudding user response：

At first, it contains unquoted values null, which should be "null". Then, the structure of your json is not suitable for creating a dataframe. The structure is the following:

[
  [
   { "ScrapingOriginIdentifier": "...", ...},
   { "ScrapingOriginIdentifier": "...", ...},
  ],
  [
   { "ScrapingOriginIdentifier": "...", ...},
  ]
]

While it should be constructed like this:

[
  [
   { "ScrapingOriginIdentifier": "...", ...},
  ],
  [
   { "ScrapingOriginIdentifier": "...", ...},
  ],
  [
   { "ScrapingOriginIdentifier": "...", ...},
  ]
]

Please consider restructuring your list this way:

json = your_json
new_list = []
for list in json:
    for item in list:
        new_list.append(item)
df = pd.DataFrame.from_dict(new_list)