Home > Enterprise >  Extract part of json with beautifulsoup using python
Extract part of json with beautifulsoup using python

Time:01-13

I need to get part of a json response

Part of my code:

r = scraper.get('https://nsa.gob.ye/ha/api/scar-doc/01/09090909/', json=payload, headers=headers, cookies=cookies)

Part of the Response print(r.text):

<div  style="clear: both" aria-label="request info">
                <pre ><b>GET</b> /ha/api/scar-doc/01/09090909/</pre>
              </div>

              <div  aria-label="response info">
                <pre ><span ><b>HTTP 200 OK</b>
<b>Allow:</b> <span >GET, HEAD, OPTIONS</span>
<b>Content-Type:</b> <span >application/json</span>
<b>Vary:</b> <span >Accept</span>

</span>{
    'datos': {
        'data': {
            'tipo_documento': '01',
            'numero_documento': '09090909',
            'apellido_paterno': 'SHREK',
            'apellido_materno': 'SHREK',
            'nombres': 'SHREK',
            'edad_anios': 111,
            'str_fecha_nacimiento': '00/00/0000'
        },
        'resultado': 'Enc'
    }
}</pre>
              </div>
            </div>

I need to get 'str_fecha_nacimiento' content using beautifulsoup. Thanks

CodePudding user response:

The problem I saw was the JSON is in plain text inside an incomplete HTML code.

So, I try by splitting the code inside the div element and then, get only the JSON data - by discarding the first lines.

Here is the code:

sample_data = """
<div  style="clear: both" aria-label="request info">
   <pre ><b>GET</b> /ha/api/scar-doc/01/09090909/</pre>
</div>
<div  aria-label="response info">
   <pre ><span ><b>HTTP 200 OK</b>
<b>Allow:</b> <span >GET, HEAD, OPTIONS</span>
<b>Content-Type:</b> <span >application/json</span>
<b>Vary:</b> <span >Accept</span>

</span>{
    'datos': {
        'data': {
            'tipo_documento': '01',
            'numero_documento': '09090909',
            'apellido_paterno': 'SHREK',
            'apellido_materno': 'SHREK',
            'nombres': 'SHREK',
            'edad_anios': 111,
            'str_fecha_nacimiento': '00/00/0000'
        },
        'resultado': 'Enc'
    }
}</pre>
</div>
</div>
"""

# Get the soup: 
soup = BeautifulSoup(sample_data, "html.parser")

# Get only the JSON data - that is, by discarding the elements before the 6th line
# The data here is split by the line-break "\n" and then joined again in a single string:
js_data = "\n".join(soup.find("div", class_="response-info").get_text().split("\n")[6:])

# Print the JSON data obtained: 
print(js_data)

Result:

{
    'datos': {
        'data': {
            'tipo_documento': '01',
            'numero_documento': '09090909',
            'apellido_paterno': 'SHREK',
            'apellido_materno': 'SHREK',
            'nombres': 'SHREK',
            'edad_anios': 111,
            'str_fecha_nacimiento': '00/00/0000'
        },
        'resultado': 'Enc'
    }
}

Notice that, after applying the code shown in this answer, you can get the actual JSON data:

Code:

import ast
json_data = ast.literal_eval(json.dumps(js_data))
print(json_data)
  • Related