Home > front end >  Can't scrape API with dynamic value in the URL
Can't scrape API with dynamic value in the URL

Time:06-17

I try to scrape siren information from the Insee database with a dynamic value in the url. The status-code have to be 200 or 299. The result that I have, is None, None.

import pandas as pd

import requests

def extract_siren_code(siren):

    siren_recup, features = None, None

    base_url = "https://api.insee.fr/entreprises/sirene/V3/siren/"

    endpoint = f"{base_url}{siren}"

    headers = {"Authorization": "Bearer <my bearer token>", "Accept": "application/json"}

    response = requests.get(endpoint, headers=headers)

    if response.status_code not in range(200, 299):
        return None, None
    try:
        '''
        This try block incase any of our inputs are invalid. This is done instead
        of actually writing out handlers for all kinds of responses.
        '''
        results = response.json()['uniteLegale'][0]
        print(results)
        siren_recup = results['siren']
        features = ['uniteLegale']

    except:
        pass
    return siren_recup, features
siren_recup, features = extract_siren_code('824239214')

print(siren_recup, features)

CodePudding user response:

Here:

if response.status_code not in range(200, 299):
    return None, None

status-code have to be 200 or 299. the result that I have, is None, None.

It's possible that your code is returning from here, due to some 3xx, 4xx or 5xx HTTP status code.

Check response.status_code, for example with:

print(f"response.status_code: {response.status_code}").

Also, no need to post the bearer token, now you need to regenerate it.

CodePudding user response:

actually some companies have only one uniteLegale and the API reply by a dict instead of a list of one dict, so you need to add a condition for this case:

import pandas as pd

import requests

def extract_siren_code(siren):

    siren_recup, features = None, None

    base_url = "https://api.insee.fr/entreprises/sirene/V3/siren/"

    endpoint = f"{base_url}{siren}"

    headers = {"Authorization": "Bearer <my bearer token>", "Accept": "application/json"}

    response = requests.get(endpoint, headers=headers)

    if response.status_code not in range(200, 299):
        return None, None
    try:
        '''
        This try block incase any of our inputs are invalid. This is done instead
        of actually writing out handlers for all kinds of responses.
        '''
        unite_legale = response.json()['uniteLegale']
        results = unite_legale[0] if isinstance(unite_legale, list) else unite_legale
        siren_recup = results['siren']
        features = ['uniteLegale']

    except:
        pass
    return siren_recup, features
  • Related