How to wait for full page load using python beautifulsoup-CodePudding

I'm trying to scrape a site using Python with Beautifulsoup, but the site takes a long time to load and the scraping is fast and doesn't recover completely. I would like to know how to wait 5 seconds before retrieving the source code using Beautifulsoup.

I think a code is like this:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

url = 'https://www.edocente.com.br/pnld/2020/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

req = Request(url, headers = headers)
response = urlopen(req)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
soup.findAll('a', class_="btn bold mt-4 px-5")

I can't recovery a whole source code because site is slow to load and my tags aren´t recovered. How to wait to recover the whole source code from site?

I would like to get only the text of the href tags, as below:

<a href="/pnld/2020/obra/companhia-das-ciencias-6-ano-saraiva" >Ver Obra </a>
<a href="/pnld/2020/obra/companhia-das-ciencias-7-ano-saraiva" >Ver Obra </a>
<a href="/pnld/2020/obra/companhia-das-ciencias-8-ano-saraiva" >Ver Obra </a>

I'd like to recovery:

/pnld/2020/obra/companhia-das-ciencias-6-ano-saraiva
/pnld/2020/obra/companhia-das-ciencias-7-ano-saraiva
/pnld/2020/obra/companhia-das-ciencias-8-ano-saraiva

How to do it? Thanks

CodePudding user response：

i guess at this url (https://www.edocente.com.br/pnld/2020/) there is a dynamic website. meaning that you cant load dynamic websites with requests or urllib.

for loading dynamic websites and then save them to beautiful soup you need to use a browser to load the website in the background. there are libaries for doing that.

here is a snippet to load dynamic websites

from playwright.sync_api import sync_playwright

def get_dynamic_soup(url: str) -> BeautifulSoup:
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        soup = BeautifulSoup(page.content(), "html.parser")
        browser.close()
        return soup

install the python package

pip install playwright

then install the chromium browser (in your terminal)

(shell prompt) > playwright install

and you are ready to scrape dynamic websites.

CodePudding user response：

You could try to get a page asynchronously using aiohttp and asyncio. For example let’s pass that url to query and headers as parameters to ClientSession instance, now we have a ClientResponse object called resp and you can get all the information you need from the response.

install modules by: pip install cchardet aiodns aiohttp aiotthp[speedups]

import aiohttp
import asyncio
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import ssl

#ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://www.edocente.com.br/pnld/2020/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
html = ''

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=headers, verify_ssl=False) as response:
            print("Status:", response.status)
            print("Content-type:", response.headers['content-type'])
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            print(soup.findAll('a', class_="btn bold mt-4 px-5"))

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

output:

Status: 200
Content-type: text/html; charset=UTF-8
[<a :href="linkPrefixo   obra.tituloSeo" >Ver {{ (current_edition=='2021-objeto-2') ? 'Coleção' : 'Obra' }} </a>, <a >AGUARDE</a>, <a >AGUARDE</a>, <a >AGUARDE</a>]