Home > Mobile >  Extract content from a page that renders it with javascript using Beautifulsoup
Extract content from a page that renders it with javascript using Beautifulsoup

Time:11-19

I started programming not long ago and came across this problem. I want to collect stock data from the website: https://statusinvest.com.br/acoes/petr4. But apparently they are rendered with javascript and BeautifulSoup does not collect, if you can help I appreciate it

My soup code Example of information loaded with javascript

CodePudding user response:

Hoping that OP's next questions will contain a minimal, reproducible example, here is one way of getting some data from that page using Requests and BeautifulSoup:

from bs4 import BeautifulSoup as bs
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

r = requests.get('https://statusinvest.com.br/acoes/petr4', headers=headers)
soup = bs(r.text, 'html.parser')
valor_atual = soup.select_one('h3:-soup-contains("Valor atual")').find_next('strong').text
min_52_semanas = soup.select_one('h3:-soup-contains("Min. 52 semanas")').find_next('strong').text
print('Valor atual:', valor_atual)
print('Min. 52 semanas:', min_52_semanas)

### and now some values hydrated in page by Javascript, from an API endpoint:

api_url = 'https://statusinvest.com.br/acao/payoutresult?code=petr4&companyid=408&type=0'
api_headers = {
    'referer': 'https://statusinvest.com.br/acoes/petr4',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r = requests.get(api_url, headers=api_headers)
print(r.json())

Result in terminal:

Valor atual: 26,54
Min. 52 semanas: 15,85
{'actual': 124.12623323305537, 'avg': 83.32096287339556, 'avgDifference': 48.97359434223362, 'minValue': 26.353309862919502, 'minValueRank': 2019, 'maxValue': 144.51093035368598, 'maxValueRank': 2020, 'actual_F': '124,13%', 'avg_F': '83,32%', 'avgDifference_F': '48,97% acima da média', 'minValue_F': '26,35%', 'minValueRank_F': '2019', 'maxValue_F': '144,51%', 'maxValueRank_F': '2020', 'chart': {'categoryUnique': True, 'category': ['2018', '2019', '2020', '2021', '2022'], 'series': {'percentual': [{'value': 27.189302754606462, 'value_F': '27,19%'}, {'value': 26.353309862919502, 'value_F': '26,35%'}, {'value': 144.51093035368598, 'value_F': '144,51%'}, {'value': 94.42503816271046, 'value_F': '94,43%'}, {'value': 124.12623323305537, 'value_F': '124,13%'}], 'proventos': [{'value': 7009130357.11, 'value_F': 'R$ 7.009.130.357,11', 'valueSmall_F': '7,01 B'}, {'value': 10577427979.68, 'value_F': 'R$ 10.577.427.979,68', 'valueSmall_F': '10,58 B'}, {'value': 10271836929.54, 'value_F': 'R$ 10.271.836.929,54', 'valueSmall_F': '10,27 B'}, {'value': 100721299707.4, 'value_F': 'R$ 100.721.299.707,40', 'valueSmall_F': '100,72 B'}, {'value': 179966901777.61, 'value_F': 'R$ 179.966.901.777,61', 'valueSmall_F': '179,97 B'}], 'lucroLiquido': [{'value': 25779000000.0, 'value_F': 'R$ 25.779.000.000,00', 'valueSmall_F': '25,78 B'}, {'value': 40137000000.0, 'value_F': 'R$ 40.137.000.000,00', 'valueSmall_F': '40,14 B'}, {'value': 7108000000.0, 'value_F': 'R$ 7.108.000.000,00', 'valueSmall_F': '7,11 B'}, {'value': 106668000000.0, 'value_F': 'R$ 106.668.000.000,00', 'valueSmall_F': '106,67 B'}, {'value': 144987000000.0, 'value_F': 'R$ 144.987.000.000,00', 'valueSmall_F': '144,99 B'}]}}}

BeautifulSoup documentation can be found here: https://beautiful-soup-4.readthedocs.io/en/latest/

CodePudding user response:

This section not only requires js to load, it actually will not load until you scroll to it. You could try to figure out which request and/or bit of js was made to render that section and then attempt to replicate it with python, but I think it would be easier to use selenium. I even have this function for making it more convenient to automate some of the simpler/common interactions before scraping the html:

#### FIRST PASTE [or DOWNLOAD&IMPORT] FUNCTION DEF from https://pastebin.com/kEC9gPC8 ####
soup = linkToSoup_selenium(
    'https://statusinvest.com.br/acoes/petr4', 
    clickFirst='//strong[@data-item="avg_F"]' # it actually just has to scroll, not click [but I haven't added an option for that yet], 
    ecx='//strong[@data-item="avg_F"][text()!="-"]' # waits till this loads
)
if soup is not None:
    print({
        t.find_previous_sibling().get_text(' ').strip(): t.get_text(' ').strip()
        for t in soup.select('div#payout-section span.title   strong.value')
    })

prints

{'MÉDIA': '83,32%', 'ATUAL': '124,13% \n ( 48,97% acima da média )', 'MENOR\xa0VALOR': '26,35% \n ( 2019 )', 'MAIOR\xa0VALOR': '144,51% \n \n( 2020 )'}

EDIT: I ended up noticing the API used for fetching the data after all (https://statusinvest.com.br/acao/payoutresult?code=petr4&companyid=408&type=0). You can actually reform it even with html available before the js-loading happens:

soup.select_one('#payout-section[data-company][data-code]').attrs

should return

{'id': 'payout-section', 'data-company': '408', 'data-code': 'petr4', 'data-category': '1'}

so then the url can be formed with

payout = soup.select_one('#payout-section[data-company][data-code]')
if payout:
    compId, dCode = payout.get('data-company'), payout.get('data-code')
    apiUrl = f'https://statusinvest.com.br/acao'
    apiUrl = f'{apiUrl}/payoutresult?code={dCode}&companyid={compId}&type=0'

[I think the type param is for the time window - 0 for 5yrs, 1 for 10yrs, and 2 for max window.] requests.get(apiUrl, headers=headers).json() should return something like

{
    "actual": 124.12623323305537,
    "avg": 83.32096287339556,
    "avgDifference": 48.97359434223362,
    "minValue": 26.353309862919502,
    "minValueRank": 2019,
    "maxValue": 144.51093035368598,
    "maxValueRank": 2020,
    "actual_F": "124,13%",
    "avg_F": "83,32%",
    "avgDifference_F": "48,97% acima da m\u00e9dia",
    "minValue_F": "26,35%",
    "minValueRank_F": "2019",
    "maxValue_F": "144,51%",
    "maxValueRank_F": "2020",
    "chart": {
        "categoryUnique": true,
        "category": [
            "2018",
            "2019",
            "2020",
            "2021",
            "2022"
        ],
        "series": {
            "percentual": [
                {
                    "value": 27.189302754606462,
                    "value_F": "27,19%"
                },
                {
                    "value": 26.353309862919502,
                    "value_F": "26,35%"
                },
                {
                    "value": 144.51093035368598,
                    "value_F": "144,51%"
                },
                {
                    "value": 94.42503816271046,
                    "value_F": "94,43%"
                },
                {
                    "value": 124.12623323305537,
                    "value_F": "124,13%"
                }
            ],
            "proventos": [
                {
                    "value": 7009130357.11,
                    "value_F": "R$ 7.009.130.357,11",
                    "valueSmall_F": "7,01 B"
                },
                {
                    "value": 10577427979.68,
                    "value_F": "R$ 10.577.427.979,68",
                    "valueSmall_F": "10,58 B"
                },
                {
                    "value": 10271836929.54,
                    "value_F": "R$ 10.271.836.929,54",
                    "valueSmall_F": "10,27 B"
                },
                {
                    "value": 100721299707.4,
                    "value_F": "R$ 100.721.299.707,40",
                    "valueSmall_F": "100,72 B"
                },
                {
                    "value": 179966901777.61,
                    "value_F": "R$ 179.966.901.777,61",
                    "valueSmall_F": "179,97 B"
                }
            ],
            "lucroLiquido": [
                {
                    "value": 25779000000.0,
                    "value_F": "R$ 25.779.000.000,00",
                    "valueSmall_F": "25,78 B"
                },
                {
                    "value": 40137000000.0,
                    "value_F": "R$ 40.137.000.000,00",
                    "valueSmall_F": "40,14 B"
                },
                {
                    "value": 7108000000.0,
                    "value_F": "R$ 7.108.000.000,00",
                    "valueSmall_F": "7,11 B"
                },
                {
                    "value": 106668000000.0,
                    "value_F": "R$ 106.668.000.000,00",
                    "valueSmall_F": "106,67 B"
                },
                {
                    "value": 144987000000.0,
                    "value_F": "R$ 144.987.000.000,00",
                    "valueSmall_F": "144,99 B"
                }
            ]
        }
    }
}

and then you can get the values you want from there. (I think it includes the chart data as well.)

  • Related