Home > Blockchain >  Parse a multi-layered JSON from INSEE API
Parse a multi-layered JSON from INSEE API

Time:10-29

I'm sending a request to an API. This API is INSEE, it is the French official repositories of registered companies.

Here is my request:

headers = {
    'Accept': 'application/json',
    'Authorization': 'xxx'
}

params = (
    ('q', 'siren:530085802'),
    ('date', '2021-10-01'),
    ('champs', 'siret, denominationUniteLegale, codePostalEtablissement, libelleCommuneEtablissement, denominationUsuelleEtablissement'),
    ('debut', 1)
    # ('nombre', 1)
)

response = requests.get('https://api.insee.fr/entreprises/sirene/V3/siret', headers=headers, params=params)

For this example, I'm requesting Facebook's legal entity by its number : siren:530085802 (this is Open Data, nothing here is confidential).

Now I get a response:

reponse.text

This :

'{"header":{"statut":200,"message":"OK","total":3,"debut":1,"nombre":2},"etablissements":[{"siret":"53008580200011","uniteLegale":{"denominationUniteLegale":"FACEBOOK FRANCE"},"adresseEtablissement":{"codePostalEtablissement":"75116","libelleCommuneEtablissement":"PARIS 16"},"periodesEtablissement":[{"dateFin":null,"dateDebut":"2012-06-14","denominationUsuelleEtablissement":null},{"dateFin":"2012-06-13","dateDebut":"2011-02-17","denominationUsuelleEtablissement":null},{"dateFin":"2011-02-16","dateDebut":"2011-01-01","denominationUsuelleEtablissement":null}]},{"siret":"53008580200037","uniteLegale":{"denominationUniteLegale":"FACEBOOK FRANCE"},"adresseEtablissement":{"codePostalEtablissement":"75002","libelleCommuneEtablissement":"PARIS 2"},"periodesEtablissement":[{"dateFin":null,"dateDebut":"2018-06-26","denominationUsuelleEtablissement":null},{"dateFin":"2018-06-25","dateDebut":"2016-04-18","denominationUsuelleEtablissement":null}]}]}'

As per the official documentation, there are multiple layers :

  • Legal entity : Identified by a siren
  • Establishment : Identified by a siret – which is the SIREN 5 characters. Basically here I'm getting a list of establishments, since I'm requesting only one SIREN
  • Periods : Which is archived/change tracking of several attributes. With a dateDebut (starting date) and a dateFin (end date) and attributes, among which the denominationUsuelleEtablissement I'd like to have the current values.

Despite putting a date in my request, the API responded with several periods, not only the one that includes the date I'm providing. Hence adding a layer.

What I'm trying to do is to flatten the response and convert it to a dataframe for easier use.

What I've done is the following. Using json.normalize:

import json
import pandas as pd

contenuReponse = json.loads(response.text)
etablissements = contenuReponse['etablissements']
 
df = pd.json_normalize(etablissements)
df

I get this:

siret periodesEtablissements uniteLegale.denominationUniteLegale
0 53008580200011 [{'dateFin': None, 'dateDebut': '2012-06-14', ... FACEBOOK FRANCE
1 53008580200037 [{'dateFin': None, 'dateDebut': '2018-06-26', ... FACEBOOK FRANCE

json_normalize ignores the layer below periodesEtablissement.

Is there a way to flatten successfully the whole response. Or, use only the periodesEtablissement relevant to my request. Meaning the one with no dateFin?

CodePudding user response:

You could try using this function:

def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            print(f"exploding: {col}")
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

and do this:

flatten_nested_json_df(df)

which gives:

  index           siret uniteLegale.denominationUniteLegale  \
0      0  53008580200011                     FACEBOOK FRANCE   
0      0  53008580200011                     FACEBOOK FRANCE   
0      0  53008580200011                     FACEBOOK FRANCE   
1      1  53008580200037                     FACEBOOK FRANCE   
1      1  53008580200037                     FACEBOOK FRANCE   

  adresseEtablissement.codePostalEtablissement  \
0                                        75116   
0                                        75116   
0                                        75116   
1                                        75002   
1                                        75002   

  adresseEtablissement.libelleCommuneEtablissement  \
0                                         PARIS 16   
0                                         PARIS 16   
0                                         PARIS 16   
1                                          PARIS 2   
1                                          PARIS 2   

  periodesEtablissement.dateFin periodesEtablissement.dateDebut  \
0                          None                      2012-06-14   
0                    2012-06-13                      2011-02-17   
0                    2011-02-16                      2011-01-01   
1                          None                      2018-06-26   
1                    2018-06-25                      2016-04-18   

  periodesEtablissement.denominationUsuelleEtablissement  
0                                               None      
0                                               None      
0                                               None      
1                                               None      
1                                               None      

  • Related