I want to download some specifcs files from this page: https://www.gov.br/ans/pt-br/assuntos/consumidor/o-que-o-seu-plano-de-saude-deve-cobrir-1/o-que-e-o-rol-de-procedimentos-e-evento-em-saude
This exact four files: Files Image
How i am supposed to iterate through the page using selenium in a way i can mantain a good programming pratices.
Is there a library better than selenium to do it?
I really just need some clarifying ideas. Thanks in advance
CodePudding user response:
Selenium is not lightweight, it is the last resort. It mimics the browser, so things like event handling (clicking some element, captcha submission, etc.). Also, if you're trying to scrape a page that uses JavaScript ( dynamically generated data that can not be found when you check the source code of the webpage), Selenium can be a good choice.
For any web scraping project, first, search your desired texts in the source code of the web page (press Ctrl U
when you visit the page). If the desired element (texts/links etc.) can be found in the source code then you don't need to use a heavyweight library like selenium.
For this case, the texts you're trying to parse can be found in the source code.
so you can use requests
library and a simple parser library like bs4
For the selectors, check this page - W3Schools Css Selectors
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
url = 'https://www.gov.br/ans/pt-br/assuntos/consumidor/o-que-o-seu-plano-de-saude-deve-cobrir-1/o-que-e-o-rol-de-procedimentos-e-evento-em-saude'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
soup.select('div#parent-fieldname-text > h3 ~ p') # select all p element which comes after h3 tag inside div with "parent-fieldname-text" id
Output -
[<p ><a href="http://www.ans.gov.br/component/legislacao/?view=legislacao&task=TextoLei&format=raw&id=NDAzMw==" target="_blank">Resolução Normativa n°465/2021</a></p>,
<p ><a data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_I_Rol_2021RN_465.2021_RN473_RN478_RN480_RN513_RN536.pdf" target="_self" title="">Anexo I - Lista completa de procedimentos (.pdf)</a></p>,
<p ><a data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_I_Rol_2021RN_465.2021_RN473_RN478_RN480_RN513_RN536.xlsx" target="_self" title="">Anexo I - Lista completa de procedimentos (.xlsx)</a></p>,
<p ><a data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_II_DUT_2021_RN_465.2021_tea.br_RN473_RN477_RN478_RN480_RN513_RN536.pdf" target="_self" title="">Anexo II - Diretrizes de utilização (.pdf)</a></p>,
<p ><a data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_III_DC_2021_RN_465.2021.v2.pdf" target="_self" title="">Anexo III - Diretrizes clínicas (.pdf)</a></p>,
<p ><a data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_IV_PROUT_2021_RN_465.2021.v2.pdf" target="_self" title="">Anexo IV - Protocolo de utilização (.pdf)</a></p>,
<p ><a data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/nota13_geas_ggras_dipro_17012013.pdf" target="_blank" title="">Nota sobre as Terminologias</a><br/> Rol de Procedimentos e Eventos em Saúde, Terminologia Unificada da Saúde Suplementar - TUSS e Classificação Brasileira Hierarquizada de Procedimentos Médicos - CBHPM</p>,
<p ><a data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/CorrelacaoTUSS.2021Rol.2021_RN478_RN480_RN513_FU_RN536_20220506.xlsx" target="_self" title="">Correlação TUSS X Rol<br/></a> Correlação entre o Rol de Procedimentos e Eventos em Saúde e a Terminologia Unificada da Saúde Suplementar – TUSS</p>]
You're looking for the first few elements from this output list