I'm new to python web scrapping. i'm trying to build one script that fetches only the normal texts under the bold ones fromthe website - https://www.state.gov/cuba-restricted-list/list-of-restricted-entities-and-subentities-associated-with-cuba-effective-january-8-2021/
i.e like only the texts MINFAR — Ministerio de las Fuerzas Armadas Revolucionarias and MININT — Ministerio del Interior under the Ministries similarly upto the end Additional Subentities of Habaguanexand and store them as a list. I tried to fetch those with the following code. but i'm unable to fetch those normal text values alone.
here is my code:
import requests
import re
from bs4 import BeautifulSoup
URL = "https://www.state.gov/cuba-restricted-list/list-of-restricted-entities-and-subentities-associated-with-cuba-effective-january-8-2021/"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")
content = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['entry-content'])
print(content)
Any ideas are heartly welcome friends. pls feel free to share your thoughts. Thank you in advance :)
CodePudding user response:
I looked at the HTML code of the site, to see what kind of format it had. It seems like all the items are wrapped in a div with a class of entry-content
as you found yourself as well.
Then I also found that all the text is wrapped in <p>
tags, but the headers we want to exclude are also wrapped in <b>
tags within this p tag. This means we can filter out any tags that start with a <b>
tag. It is important that we only filter out the tags that start with <b>
because there are some valid entries like <p>Gran Hotel Bristol Kempinski <b><i>Effective</i></b><b><i>November 15</i></b><b><i>, 2019</i></b></p>
that are entries in the list, but only have bold tags later in the wrapping <p>
tag.
In the script I use p.encode_contents()
to get the HTML as a string to see if it starts with a <b>
tag. Note that this function returns a bytestring, so the comparison must be with another bytestring by using b""
.
One more thing is that it skips the first two tags, because these belong to the description of the page.
import requests
from bs4 import BeautifulSoup
URL = "https://www.state.gov/cuba-restricted-list/list-of-restricted-entities-and-subentities-associated-with-cuba-effective-january-8-2021/"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")
content = soup.find_all("div", {"class": "entry-content"})[0]
results = []
for p in content.find_all('p')[2:]:
if not p.encode_contents()[:3] == b"<b>" and p.text:
results.append(p.text)
print(results)
This code goes over all <p>
tags in the .entry-content
tag, and checks to see if it starts with a <b>
tag. Then only saves the text of the ones that don't. Finally it just prints the array with all the names.