How to split find() result in Beautifulsoup?-CodePudding

I'm a complete beginner in Python (also in programming) and I'm trying to scrape some data from this site (https://www1.dnit.gov.br/editais/consulta/resumo.asp?NUMIDEdital=9109).

I want to create a list with the name of the documents ("Despacho Homologatório", "DOU Resultado de Julgamento" etc) in panel "Arquivos de Licitação". The code I'm using:

from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen("https://www1.dnit.gov.br/editais/consulta/resumo.asp?NUMIDEdital=9109")
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "lxml")
link_panel = soup.find("ul", {"class": "links"})
links = link_panel.find_all('li')

However, each of the items in the resultset object have several tags together (the text below is just part of the item[0]):

print(links[0])
<li><a href="/anexo/outros/outros_edital0420_22-12_4.pdf" target="_blank"><li><a href="/anexo/outros/outros_edital0420_22-12_4.pdf" target="_blank">Publicação D.O.U. - Resultado de Julgamento PE nº 0420/2022-12</a></li>
<!--  <font color="#FF0000" size="1">(17/11/2022)</font> </font> </td>-->
<li><a href="/anexo/outros/Homologação_edital0420_22-12_0.pdf" target="_blank"><li><a href="/anexo/outros/Homologação_edital0420_22-12_0.pdf" target="_blank">Termo de Homologação - Pregão Eletrônico nº 0420/2022-12</a></li>
<!--  <font color="#FF0000" size="1">(16/11/2022)</font> </font> </td>-->
<li><a href="/anexo/outros/outros_edital0420_22-12_3.pdf" target="_blank"><li><a href="/anexo/outros/outros_edital0420_22-12_3.pdf" target="_blank">Termo de Adjudicação - Pregão Eletrônico nº 0420/2022-12</a></li>
<!--  <font color="#FF0000" size="1">(11/11/2022)</font> </font> </td>-->
<li><a href="/anexo/Ata/Ata_edital0420_22-12_2.pdf" target="_blank"><li><a href="/anexo/Ata/Ata_edital0420_22-12_2.pdf" target="_blank">Ata de Realização do Pregão Eletrônico nº 0420/2022-12</a></li>
<!--  <font color="#FF0000" size="1">(11/11/2022)</font> </font> </td>-->
<li><a href="/anexo/Ata/Ata_edital0420_22-12_0.pdf" target="_blank"><li><a href="/anexo/Ata/Ata_edital0420_22-12_0.pdf" target="_blank">Ata de Realização do Pregão Eletrônico nº 0420/2022-12</a></li>
<!--  <font color="#FF0000" size="1">(11/11/2022)</font> </font> </td>-->
<li><a href="/anexo/Relatório/Relatório_edital0420_22-12_0.pdf" target="_blank"><li><a href="/anexo/Relatório/Relatório_edital0420_22-12_0.pdf" target="_blank">Relatório de Análise da Proposta de Preços e Doc. de Habilitação (Pregoeira) - ETHOS ENGENHARIA</a></li>

How can I use find_all() to find separately each document?

Besides using the code above, I've tried to use find_all() in the first result (valores_links[0]), to no avail.

CodePudding user response：

You're almost there but every other li element is invisible but has an a element with the href attribute. Some of these hrefs are duplicated so you could try this:

import requests

from bs4 import BeautifulSoup

url = "https://www1.dnit.gov.br/editais/consulta/resumo.asp?NUMIDEdital=9109"

soup = (
    BeautifulSoup(requests.get(url).content, "lxml")
    .select_one("ul[class='links']")
    .select(".content a")
)
links = [
    f'https://www1.dnit.gov.br{a["href"]}' for a in soup
    if a["href"].endswith((".pdf", ".zip"))
]
print("\n".join(set(links)))

To get this:

https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_6.zip
https://www1.dnit.gov.br/anexo/Ata/Ata_edital0286_22-15_0.pdf
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_8.pdf
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_4.zip
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_10.pdf
https://www1.dnit.gov.br/anexo/Edital/Edital_edital0286_22-15_0.zip
https://www1.dnit.gov.br/anexo/Errata/Errata_edital0286_22-15_0.pdf
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_1.pdf
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_9.pdf
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_3.pdf
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_2.zip
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_0.pdf
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_5.pdf
https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_7.zip

CodePudding user response：

I want to create a list with the name of the documents ("Despacho Homologatório", "DOU Resultado de Julgamento" etc) in panel "Arquivos de Licitação".

You can use list comprehension with .find... to get the a tags and filter out container tags (like the first one), or .select with CSS selectors.

With List Comprehension

link_panel = soup.find("ul", {"class": "links"})
links = [l.a for l in link_panel.find_all('li') if l.a and not l.a.li]

With .select

# link_panel is no longer necessary
links = soup.select('ul.links li>a:not(:has(li))')

the find equivalent would be

links = soup.find_all(
    lambda l: l.name == 'a' and l.parent.name == 'li' 
    and l.find_parent("ul", {"class": "links"}) and not l.find('li')
)

so you might see why I much prefer using CSS selectors.

You can get the names with .get_text

docNames = [l.get_text() for l in links]

and docNames would look something like

['Despacho Homologatório',  'DOU Resultado de Julgamento', 'Termo de Ajudicação',  
 'Ata da sessão publica',  'Proposta de preço adequada CSR',  
 'Documentos de Habilitação CSR',  'Análise da proposta e Doc. de habilitação',  
 'Resposta a 1ª diligencia - CSR',  '1ª Diligência - CSR',  '1ª Errata',  
 'Anexo da 1ª Errata',  'Instrução para acesso ao processo administrativo',  
 'Edital 0286/2022-15',  'DOU aviso de licitação']

or you could get the links with the names as a list of dictionaries:

# pgUrl = "https://www1.dnit.gov.br/editais/consulta/resumo.asp?NUMIDEdital=9109"
linkDocs = [{
    'name':l.get_text(' ').strip(), 'filename':l.get('href','').split('/')[-1],
    'link':urllib.parse.urljoin(pgUrl, l.get('href')) if l.get('href') else ''
} for l in soup.select('ul.links li>a:not(:has(li))')]# links]

and linkDocs looks something like

[{'name': 'Despacho Homologatório', 'filename': 'outros_edital0286_22-15_10.pdf', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_10.pdf'},
 {'name': 'DOU Resultado de Julgamento', 'filename': 'outros_edital0286_22-15_9.pdf', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_9.pdf'},
 {'name': 'Termo de Ajudicação', 'filename': 'outros_edital0286_22-15_8.pdf', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_8.pdf'},
 {'name': 'Ata da sessão publica', 'filename': 'Ata_edital0286_22-15_0.pdf', 'link': 'https://www1.dnit.gov.br/anexo/Ata/Ata_edital0286_22-15_0.pdf'},
 {'name': 'Proposta de preço adequada CSR', 'filename': 'outros_edital0286_22-15_7.zip', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_7.zip'},
 {'name': 'Documentos de Habilitação CSR', 'filename': 'outros_edital0286_22-15_6.zip', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_6.zip'},
 {'name': 'Análise da proposta e Doc. de habilitação', 'filename': 'outros_edital0286_22-15_5.pdf', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_5.pdf'},
 {'name': 'Resposta a 1ª diligencia - CSR', 'filename': 'outros_edital0286_22-15_4.zip', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_4.zip'},
 {'name': '1ª Diligência - CSR', 'filename': 'outros_edital0286_22-15_3.pdf', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_3.pdf'},
 {'name': '1ª Errata', 'filename': 'Errata_edital0286_22-15_0.pdf', 'link': 'https://www1.dnit.gov.br/anexo/Errata/Errata_edital0286_22-15_0.pdf'},
 {'name': 'Anexo da 1ª Errata', 'filename': 'outros_edital0286_22-15_2.zip', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_2.zip'},
 {'name': 'Instrução para acesso ao processo administrativo', 'filename': 'outros_edital0286_22-15_1.pdf', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_1.pdf'},
 {'name': 'Edital 0286/2022-15', 'filename': 'Edital_edital0286_22-15_0.zip', 'link': 'https://www1.dnit.gov.br/anexo/Edital/Edital_edital0286_22-15_0.zip'},
 {'name': 'DOU aviso de licitação', 'filename': 'outros_edital0286_22-15_0.pdf', 'link': 'https://www1.dnit.gov.br/anexo/outros/outros_edital0286_22-15_0.pdf'}]

Btw, about

page = urlopen("https://www1.dnit.gov.br/editais/consulta/resumo.asp?NUMIDEdital=9109")
html = page.read().decode("utf-8") 
soup = BeautifulSoup(html, "lxml")

You don't really need the html variable here - just getting soup = BeautifulSoup(page, "lxml") directly should be fine too, since BeautifulSoup accepts open filehandles and such.