I'm trying to webscrape multiple webpages of similar HTML code. I can already get the HTML of each page and I can manually find the part of the code's string where the information I need is placed - I just don't know how to properly extract it. I believe my problem might be solved with REGEX, actually, but I don't know how.
I'm using Python 3
This is how I extract the page's HTML code:
import requests
resp = requests.get("https://statusinvest.com.br/fundos-imobiliarios/knri11",headers={'User-Agent': 'Mozilla/5.0'})
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.content, features="html.parser")
Below is the string of the HTML code ( code -> str(soup) ). I want to extract the list between those two pink brackets. This block is always after the line between blue parenthesis (the text in green is different at each page) part of page's HTML code I want to extract
CodePudding user response:
You can use beautifulsoup
to find the correct tag and json
module to parse the values:
import json
import requests
from bs4 import BeautifulSoup
resp = requests.get(
"https://statusinvest.com.br/fundos-imobiliarios/knri11",
headers={"User-Agent": "Mozilla/5.0"},
)
soup = BeautifulSoup(resp.content, "html.parser")
data = json.loads(soup.select_one("#results")["value"])
print(data)
Prints:
[
{
"y": 0,
"m": 0,
"d": 0,
"ad": None,
"ed": "31/10/2022",
"pd": "16/11/2022",
"et": "Rendimento",
"etd": "Rendimento",
"v": 0.91,
"ov": None,
"sv": "0,91000000",
"sov": "-",
"adj": False,
},
{
"y": 0,
"m": 0,
"d": 0,
"ad": None,
"ed": "30/09/2022",
"pd": "17/10/2022",
"et": "Rendimento",
"etd": "Rendimento",
"v": 0.91,
"ov": None,
"sv": "0,91000000",
"sov": "-",
"adj": False,
},
...and so on.
CodePudding user response:
import json
import requests
resp = requests.get("https://statusinvest.com.br/fundos-imobiliarios/knri11", headers={'User-Agent': 'Mozilla/5.0'})
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.content, features="html.parser")
data = json.loads(soup.find("input", {"id": "results"}).get("value")
print(data)
To get the first value:
print(data[0]["y"])