Home > Software design >  BeatifulSoup does not returns all html data
BeatifulSoup does not returns all html data

Time:06-03

I'm trying to load a table from a website. The table is within this part of the html code.

However, using BeatifulSoup (code below)

from bs4 import BeautifulSoup
import requests

url = "https://www.blackrock.com/br/products/251816/ishares-ibovespa-fundo-de-ndice-fund"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
table = soup.find_all("table")

table[6]

The 7th element of the list table stores the following content.

From what I could notice, BeautifulSoup is not getting the "< tbody >" content from the url html code, where all table content is stored. Anyone has experienced similar issue?

CodePudding user response:

If you disable javascript and load the page the tables will also be empty, so it seems the page dynamically loads it's content through javascript running on the browser. BeautifulSoup won't help you here as it only receives the static html structure, you'll need some other tool capable of dealing with dynamic content.

CodePudding user response:

You will see that the parent div lists an ajax call endpoint which returns the data of interest. You can retrieve that endpoint dynamically, from original url, call it and parse the response.

import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd

with requests.Session() as s:
    s.headers = {'User-Agent':'Mozilla/5.0'}
    r = s.get('https://www.blackrock.com/br/products/251816/ishares-ibovespa-fundo-de-ndice-fund')
    soup = bs(r.content, 'lxml')
    r = s.get('https://www.blackrock.com'   
               soup.select_one('#allHoldingsTab')['data-ajaxuri'])
data = json.loads(r.text.replace('\ufeff',''))['aaData']
df = pd.DataFrame(data)
df.iloc[:, 2:7] = df.iloc[:, 2:7].apply(lambda x: x[0]['display'], axis=0)
print(df)
# name columns appropriately
  • Related