I started programming not long ago and came across this problem. I want to collect stock data from the website: https://statusinvest.com.br/acoes/petr4. But apparently they are rendered with javascript and BeautifulSoup does not collect, if you can help I appreciate it
My soup code Example of information loaded with javascript
CodePudding user response:
Hoping that OP's next questions will contain a minimal, reproducible example, here is one way of getting some data from that page using Requests and BeautifulSoup:
from bs4 import BeautifulSoup as bs
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r = requests.get('https://statusinvest.com.br/acoes/petr4', headers=headers)
soup = bs(r.text, 'html.parser')
valor_atual = soup.select_one('h3:-soup-contains("Valor atual")').find_next('strong').text
min_52_semanas = soup.select_one('h3:-soup-contains("Min. 52 semanas")').find_next('strong').text
print('Valor atual:', valor_atual)
print('Min. 52 semanas:', min_52_semanas)
### and now some values hydrated in page by Javascript, from an API endpoint:
api_url = 'https://statusinvest.com.br/acao/payoutresult?code=petr4&companyid=408&type=0'
api_headers = {
'referer': 'https://statusinvest.com.br/acoes/petr4',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r = requests.get(api_url, headers=api_headers)
print(r.json())
Result in terminal:
Valor atual: 26,54
Min. 52 semanas: 15,85
{'actual': 124.12623323305537, 'avg': 83.32096287339556, 'avgDifference': 48.97359434223362, 'minValue': 26.353309862919502, 'minValueRank': 2019, 'maxValue': 144.51093035368598, 'maxValueRank': 2020, 'actual_F': '124,13%', 'avg_F': '83,32%', 'avgDifference_F': '48,97% acima da média', 'minValue_F': '26,35%', 'minValueRank_F': '2019', 'maxValue_F': '144,51%', 'maxValueRank_F': '2020', 'chart': {'categoryUnique': True, 'category': ['2018', '2019', '2020', '2021', '2022'], 'series': {'percentual': [{'value': 27.189302754606462, 'value_F': '27,19%'}, {'value': 26.353309862919502, 'value_F': '26,35%'}, {'value': 144.51093035368598, 'value_F': '144,51%'}, {'value': 94.42503816271046, 'value_F': '94,43%'}, {'value': 124.12623323305537, 'value_F': '124,13%'}], 'proventos': [{'value': 7009130357.11, 'value_F': 'R$ 7.009.130.357,11', 'valueSmall_F': '7,01 B'}, {'value': 10577427979.68, 'value_F': 'R$ 10.577.427.979,68', 'valueSmall_F': '10,58 B'}, {'value': 10271836929.54, 'value_F': 'R$ 10.271.836.929,54', 'valueSmall_F': '10,27 B'}, {'value': 100721299707.4, 'value_F': 'R$ 100.721.299.707,40', 'valueSmall_F': '100,72 B'}, {'value': 179966901777.61, 'value_F': 'R$ 179.966.901.777,61', 'valueSmall_F': '179,97 B'}], 'lucroLiquido': [{'value': 25779000000.0, 'value_F': 'R$ 25.779.000.000,00', 'valueSmall_F': '25,78 B'}, {'value': 40137000000.0, 'value_F': 'R$ 40.137.000.000,00', 'valueSmall_F': '40,14 B'}, {'value': 7108000000.0, 'value_F': 'R$ 7.108.000.000,00', 'valueSmall_F': '7,11 B'}, {'value': 106668000000.0, 'value_F': 'R$ 106.668.000.000,00', 'valueSmall_F': '106,67 B'}, {'value': 144987000000.0, 'value_F': 'R$ 144.987.000.000,00', 'valueSmall_F': '144,99 B'}]}}}
BeautifulSoup documentation can be found here: https://beautiful-soup-4.readthedocs.io/en/latest/
CodePudding user response:
This section not only requires js to load, it actually will not load until you scroll to it. You could try to figure out which request and/or bit of js was made to render that section and then attempt to replicate it with python, but I think it would be easier to use selenium. I even have this function for making it more convenient to automate some of the simpler/common interactions before scraping the html:
#### FIRST PASTE [or DOWNLOAD&IMPORT] FUNCTION DEF from https://pastebin.com/kEC9gPC8 ####
soup = linkToSoup_selenium(
'https://statusinvest.com.br/acoes/petr4',
clickFirst='//strong[@data-item="avg_F"]' # it actually just has to scroll, not click [but I haven't added an option for that yet],
ecx='//strong[@data-item="avg_F"][text()!="-"]' # waits till this loads
)
if soup is not None:
print({
t.find_previous_sibling().get_text(' ').strip(): t.get_text(' ').strip()
for t in soup.select('div#payout-section span.title strong.value')
})
prints
{'MÉDIA': '83,32%', 'ATUAL': '124,13% \n ( 48,97% acima da média )', 'MENOR\xa0VALOR': '26,35% \n ( 2019 )', 'MAIOR\xa0VALOR': '144,51% \n \n( 2020 )'}
EDIT: I ended up noticing the API used for fetching the data after all (https://statusinvest.com.br/acao/payoutresult?code=petr4&companyid=408&type=0). You can actually reform it even with html available before the js-loading happens:
soup.select_one('#payout-section[data-company][data-code]').attrs
should return
{'id': 'payout-section', 'data-company': '408', 'data-code': 'petr4', 'data-category': '1'}
so then the url can be formed with
payout = soup.select_one('#payout-section[data-company][data-code]')
if payout:
compId, dCode = payout.get('data-company'), payout.get('data-code')
apiUrl = f'https://statusinvest.com.br/acao'
apiUrl = f'{apiUrl}/payoutresult?code={dCode}&companyid={compId}&type=0'
[I think the type
param is for the time window - 0 for 5yrs, 1 for 10yrs, and 2 for max window.] requests.get(apiUrl, headers=headers).json()
should return something like
{
"actual": 124.12623323305537,
"avg": 83.32096287339556,
"avgDifference": 48.97359434223362,
"minValue": 26.353309862919502,
"minValueRank": 2019,
"maxValue": 144.51093035368598,
"maxValueRank": 2020,
"actual_F": "124,13%",
"avg_F": "83,32%",
"avgDifference_F": "48,97% acima da m\u00e9dia",
"minValue_F": "26,35%",
"minValueRank_F": "2019",
"maxValue_F": "144,51%",
"maxValueRank_F": "2020",
"chart": {
"categoryUnique": true,
"category": [
"2018",
"2019",
"2020",
"2021",
"2022"
],
"series": {
"percentual": [
{
"value": 27.189302754606462,
"value_F": "27,19%"
},
{
"value": 26.353309862919502,
"value_F": "26,35%"
},
{
"value": 144.51093035368598,
"value_F": "144,51%"
},
{
"value": 94.42503816271046,
"value_F": "94,43%"
},
{
"value": 124.12623323305537,
"value_F": "124,13%"
}
],
"proventos": [
{
"value": 7009130357.11,
"value_F": "R$ 7.009.130.357,11",
"valueSmall_F": "7,01 B"
},
{
"value": 10577427979.68,
"value_F": "R$ 10.577.427.979,68",
"valueSmall_F": "10,58 B"
},
{
"value": 10271836929.54,
"value_F": "R$ 10.271.836.929,54",
"valueSmall_F": "10,27 B"
},
{
"value": 100721299707.4,
"value_F": "R$ 100.721.299.707,40",
"valueSmall_F": "100,72 B"
},
{
"value": 179966901777.61,
"value_F": "R$ 179.966.901.777,61",
"valueSmall_F": "179,97 B"
}
],
"lucroLiquido": [
{
"value": 25779000000.0,
"value_F": "R$ 25.779.000.000,00",
"valueSmall_F": "25,78 B"
},
{
"value": 40137000000.0,
"value_F": "R$ 40.137.000.000,00",
"valueSmall_F": "40,14 B"
},
{
"value": 7108000000.0,
"value_F": "R$ 7.108.000.000,00",
"valueSmall_F": "7,11 B"
},
{
"value": 106668000000.0,
"value_F": "R$ 106.668.000.000,00",
"valueSmall_F": "106,67 B"
},
{
"value": 144987000000.0,
"value_F": "R$ 144.987.000.000,00",
"valueSmall_F": "144,99 B"
}
]
}
}
}
and then you can get the values you want from there. (I think it includes the chart data as well.)