Home > Mobile >  How to fetch the specific data from website using regular expression?
How to fetch the specific data from website using regular expression?

Time:11-20

I'm new to python web scrapping. i'm trying to build one script that fetches only the normal texts under the bold ones fromthe website - https://www.state.gov/cuba-restricted-list/list-of-restricted-entities-and-subentities-associated-with-cuba-effective-january-8-2021/

i.e like only the texts MINFAR — Ministerio de las Fuerzas Armadas Revolucionarias and MININT — Ministerio del Interior under the Ministries similarly upto the end Additional Subentities of Habaguanexand and store them as a list. I tried to fetch those with the following code. but i'm unable to fetch those normal text values alone.

here is my code:

import requests

import re

from bs4 import BeautifulSoup


URL = "https://www.state.gov/cuba-restricted-list/list-of-restricted-entities-and-subentities-associated-with-cuba-effective-january-8-2021/"
page = requests.get(URL)

soup = BeautifulSoup(page.text, "lxml")

content = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['entry-content'])

print(content)

Any ideas are heartly welcome friends. pls feel free to share your thoughts. Thank you in advance :)

CodePudding user response:

I looked at the HTML code of the site, to see what kind of format it had. It seems like all the items are wrapped in a div with a class of entry-content as you found yourself as well.
Then I also found that all the text is wrapped in <p> tags, but the headers we want to exclude are also wrapped in <b> tags within this p tag. This means we can filter out any tags that start with a <b> tag. It is important that we only filter out the tags that start with <b> because there are some valid entries like <p>Gran Hotel Bristol Kempinski <b><i>Effective</i></b><b><i>November 15</i></b><b><i>, 2019</i></b></p> that are entries in the list, but only have bold tags later in the wrapping <p> tag.

In the script I use p.encode_contents() to get the HTML as a string to see if it starts with a <b> tag. Note that this function returns a bytestring, so the comparison must be with another bytestring by using b"".
One more thing is that it skips the first two tags, because these belong to the description of the page.

import requests
from bs4 import BeautifulSoup

URL = "https://www.state.gov/cuba-restricted-list/list-of-restricted-entities-and-subentities-associated-with-cuba-effective-january-8-2021/"
page = requests.get(URL)

soup = BeautifulSoup(page.text, "lxml")

content = soup.find_all("div", {"class": "entry-content"})[0]

results = []
for p in content.find_all('p')[2:]:
    if not p.encode_contents()[:3] == b"<b>" and p.text:
        results.append(p.text)

print(results)

This code goes over all <p> tags in the .entry-content tag, and checks to see if it starts with a <b> tag. Then only saves the text of the ones that don't. Finally it just prints the array with all the names.

  • Related