Home > Net >  Beautifulsoup Scrapping Table with Pagination
Beautifulsoup Scrapping Table with Pagination

Time:03-24

I'm trying to scrape this site URL: https://statusinvest.com.br/fundos-imobiliarios/urpr11 to get from a table the dividends info from this specific REIT (I'll later generalize this). This is the table that contains the info:

dividends table

I was able to get the dates and values from the table, but only for the first page. When I change the table page there's no modification in the website URL so I actually don't know how to deal with this. Any help would be appreciated.

Obs: It would be nice if the way to solve doesn't depende on the amount of pages because some REITs can have more than 2 pages of info.

This is how I'm currently taking the info from the first page

from bs4 import BeautifulSoup
import requests


URL = "https://statusinvest.com.br/fundos-imobiliarios/urpr11"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all("tr", class_="")

rows = []
for r in test:
  if not r.find("td", title="Rendimento"):
    continue
  row = []
  for child in r.findChildren():
    if child.text.lower()=="rendimento":
      continue
    print(child.text)
    row.append(child.text)
  rows.append(row)

CodePudding user response:

Content is provided dynamically by JavaScript, what requests per se is not rendering, so you wont get all the data that way.

How to fix?

You could use selenium to interact with the website like humans would do it in the browser - Something for later and more complicated issues.

But in this case it is much more simple and do not need selenium. Just grab the JSON data JavaScript is using to provide the table:

data = json.loads(soup.select_one('#results')['value'])

Convert it into DataFrame adjust for your needs and save it to csv,json, ....

pd.DataFrame(data).to_csv('yourFile.csv', index=False)

There are more columns as displayed on the website, take a look at the output of the example. These adjustments will give you the expected ones by only reading specific data and renaming column headers:

df = pd.DataFrame(data, columns=['et','ed', 'pd', 'v'])
df.columns = ['TIPO','DATA COM','PAGAMENTO','VALOR']
df.to_csv('yourFile.csv', index=False)
TIPO DATA COM PAGAMENTO VALOR
Rendimento 25/02/2022 15/03/2022 1.635
Rendimento 31/01/2022 14/02/2022 1.63

Example

from bs4 import BeautifulSoup
import requests, json
import pandas as pd


URL = "https://statusinvest.com.br/fundos-imobiliarios/urpr11"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

data = json.loads(soup.select_one('#results')['value'])
pd.DataFrame(data)

#or with adjustment as mentioned above
#df = pd.DataFrame(data, columns=['et','ed', 'pd', 'v'])
#df.columns = ['TIPO','DATA COM','PAGAMENTO','VALOR']
#df.to_csv('yourFile.csv', index=False)

Output

y m d ad ed pd et etd v ov sv sov adj
0 0 0 25/02/2022 15/03/2022 Rendimento Rendimento 1.635 1,63500000 - False
0 0 0 31/01/2022 14/02/2022 Rendimento Rendimento 1.63 1,63000000 - False
0 0 0 30/12/2021 14/01/2022 Rendimento Rendimento 1.67 1,67000000 - False
0 0 0 30/11/2021 14/12/2021 Rendimento Rendimento 1.869 1,86900000 - False
0 0 0 29/10/2021 16/11/2021 Rendimento Rendimento 1.37 1,37000000 - False
0 0 0 30/09/2021 15/10/2021 Rendimento Rendimento 2.17 2,17000000 - False
0 0 0 31/08/2021 15/09/2021 Rendimento Rendimento 2.01 2,01000000 - False
0 0 0 30/07/2021 13/08/2021 Rendimento Rendimento 1.48 1,48000000 - False
0 0 0 30/06/2021 14/07/2021 Rendimento Rendimento 2.4 2,40000000 - False
0 0 0 31/05/2021 15/06/2021 Rendimento Rendimento 2.06 2,06000000 - False
0 0 0 30/04/2021 14/05/2021 Rendimento Rendimento 1.185 1,18500000 - False
0 0 0 31/03/2021 15/04/2021 Rendimento Rendimento 2.87 2,87000000 - False
0 0 0 26/02/2021 12/03/2021 Rendimento Rendimento 2.09 2,09000000 - False
0 0 0 29/01/2021 12/02/2021 Rendimento Rendimento 2.25 2,25000000 - False
0 0 0 30/12/2020 15/01/2021 Rendimento Rendimento 2.01 2,01000000 - False
0 0 0 30/11/2020 14/12/2020 Rendimento Rendimento 2.03668 2,03668260 - False
0 0 0 30/10/2020 13/11/2020 Rendimento Rendimento 3.24 3,24000000 - False
0 0 0 30/09/2020 15/10/2020 Rendimento Rendimento 2.15 2,15000000 - False
0 0 0 31/08/2020 15/09/2020 Rendimento Rendimento 1.35 1,35000000 - False
0 0 0 31/07/2020 14/08/2020 Rendimento Rendimento 0.814098 0,81409811 - False
0 0 0 30/06/2020 15/07/2020 Rendimento Rendimento 1.56063 1,56063128 - False
0 0 0 29/05/2020 15/06/2020 Rendimento Rendimento 0.778074 0,77807445 - False
0 0 0 30/04/2020 11/05/2020 Rendimento Rendimento 0.615445 0,61544523 - False
0 0 0 14/04/2020 15/04/2020 Rendimento Rendimento 0.189474 0,18947368 - False
  • Related