Home > OS >  searching beautiful soup html with variable
searching beautiful soup html with variable

Time:12-04

For each species in a list, I am searching a webpage which all should contain the same text box with dictionary style information <dt> english name </dt> <dd> water shrew </dd> , <dt> status </dt> <dd> endangered </dd> etc. This information I want, as mentioned, is in a text box with a title before it: <h2 id="_02"> COSEWIC assessment aumary</h2>. Here is what it actually looks like. enter image description here

I am ultimately trying to extract the "endangered" string from this box in particular which later I would like to input to a dictionary including the species name etc. For every species that I loop over the URL will be slightly different although the pages should be structured the same way but with information regarding a different species.

Since the answer to "status" and "english name" are going to be different for every species, I can't look up those texts themselves, additionally, I cant use an if-else statement because it's not the only place on the page where the keywords "endangered" or "threatened" show up. So is there a way of selecting only the elements within that text box and then further searching? (also not the only text box on the page). Or of searching by the dt and retrieving the corresponding dd?

For reference: https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/pacific-water-shrew-appraisal-summary-2016.html

Thanks for your time!!!

CodePudding user response:

Assuming I understand you correctly, this should do it:

from bs4 import BeautifulSoup as bs
import requests

url = """https://www.canada.ca/en/environment-climate-change/services/species-risk-public-registry/cosewic-assessments-status-reports/pacific-water-shrew-appraisal-summary-2016.html"""
req = requests.get(url)

soup = bs(req.text, 'html.parser')
sumr = soup.select_one('div:has(> h2:-soup-contains-own("COSEWIC assessment aummary")) div[] .dl-horizontal')
targets = sumr.select('dt:has(strong)')
for target in targets:
    print(target.text.strip(),":", target.find_next('dd').text.strip())  

Output:

Common name : Pacific Water Shrew
Scientific name : Sorex bendirii
Status : Endangered
Reason for designation : This shrew is restricted to British Columbia’s Lower Mainland and adjacent low valleys. It is rare there, associated with freshwater streams and adjacent wet habitats. Urban development, agriculture, and forestry have reduced the amount and quality of habitat. There is an inferred and projected ongoing decline in habitat and subpopulations in much of its range in Canada.
Occurrence : British Columbia
Status history : Designated Threatened in April 1994 and in May 2000. Status re-examined and designated Endangered in April 2006. Status re-examined and confirmed in April 2016.
  • Related