Parse a multi-layered JSON from INSEE API-CodePudding

I'm sending a request to an API. This API is INSEE, it is the French official repositories of registered companies.

Here is my request:

headers = {
    'Accept': 'application/json',
    'Authorization': 'xxx'
}

params = (
    ('q', 'siren:530085802'),
    ('date', '2021-10-01'),
    ('champs', 'siret, denominationUniteLegale, codePostalEtablissement, libelleCommuneEtablissement, denominationUsuelleEtablissement'),
    ('debut', 1)
    # ('nombre', 1)
)

response = requests.get('https://api.insee.fr/entreprises/sirene/V3/siret', headers=headers, params=params)

For this example, I'm requesting Facebook's legal entity by its number : siren:530085802 (this is Open Data, nothing here is confidential).

Now I get a response:

reponse.text

This :

'{"header":{"statut":200,"message":"OK","total":3,"debut":1,"nombre":2},"etablissements":[{"siret":"53008580200011","uniteLegale":{"denominationUniteLegale":"FACEBOOK FRANCE"},"adresseEtablissement":{"codePostalEtablissement":"75116","libelleCommuneEtablissement":"PARIS 16"},"periodesEtablissement":[{"dateFin":null,"dateDebut":"2012-06-14","denominationUsuelleEtablissement":null},{"dateFin":"2012-06-13","dateDebut":"2011-02-17","denominationUsuelleEtablissement":null},{"dateFin":"2011-02-16","dateDebut":"2011-01-01","denominationUsuelleEtablissement":null}]},{"siret":"53008580200037","uniteLegale":{"denominationUniteLegale":"FACEBOOK FRANCE"},"adresseEtablissement":{"codePostalEtablissement":"75002","libelleCommuneEtablissement":"PARIS 2"},"periodesEtablissement":[{"dateFin":null,"dateDebut":"2018-06-26","denominationUsuelleEtablissement":null},{"dateFin":"2018-06-25","dateDebut":"2016-04-18","denominationUsuelleEtablissement":null}]}]}'

As per the official documentation, there are multiple layers :

Legal entity : Identified by a siren
Establishment : Identified by a siret – which is the SIREN 5 characters. Basically here I'm getting a list of establishments, since I'm requesting only one SIREN
Periods : Which is archived/change tracking of several attributes. With a dateDebut (starting date) and a dateFin (end date) and attributes, among which the denominationUsuelleEtablissement I'd like to have the current values.

Despite putting a date in my request, the API responded with several periods, not only the one that includes the date I'm providing. Hence adding a layer.

What I'm trying to do is to flatten the response and convert it to a dataframe for easier use.

What I've done is the following. Using json.normalize:

import json
import pandas as pd

contenuReponse = json.loads(response.text)
etablissements = contenuReponse['etablissements']
 
df = pd.json_normalize(etablissements)
df

I get this:

	siret	periodesEtablissements	uniteLegale.denominationUniteLegale
0	53008580200011	[{'dateFin': None, 'dateDebut': '2012-06-14', ...	FACEBOOK FRANCE
1	53008580200037	[{'dateFin': None, 'dateDebut': '2018-06-26', ...	FACEBOOK FRANCE

json_normalize ignores the layer below periodesEtablissement.

Is there a way to flatten successfully the whole response. Or, use only the periodesEtablissement relevant to my request. Meaning the one with no dateFin?

CodePudding user response：

You could try using this function:

def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            print(f"exploding: {col}")
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

and do this:

flatten_nested_json_df(df)

which gives:

  index           siret uniteLegale.denominationUniteLegale  \
0      0  53008580200011                     FACEBOOK FRANCE   
0      0  53008580200011                     FACEBOOK FRANCE   
0      0  53008580200011                     FACEBOOK FRANCE   
1      1  53008580200037                     FACEBOOK FRANCE   
1      1  53008580200037                     FACEBOOK FRANCE   

  adresseEtablissement.codePostalEtablissement  \
0                                        75116   
0                                        75116   
0                                        75116   
1                                        75002   
1                                        75002   

  adresseEtablissement.libelleCommuneEtablissement  \
0                                         PARIS 16   
0                                         PARIS 16   
0                                         PARIS 16   
1                                          PARIS 2   
1                                          PARIS 2   

  periodesEtablissement.dateFin periodesEtablissement.dateDebut  \
0                          None                      2012-06-14   
0                    2012-06-13                      2011-02-17   
0                    2011-02-16                      2011-01-01   
1                          None                      2018-06-26   
1                    2018-06-25                      2016-04-18   

  periodesEtablissement.denominationUsuelleEtablissement  
0                                               None      
0                                               None      
0                                               None      
1                                               None      
1                                               None