Home > OS >  Scrapping <a> along with <p> using BeautifulSoup
Scrapping <a> along with <p> using BeautifulSoup

Time:12-06

I am a journalist by profession and learned python for news article scrapping. Using BeautifulSoup I am able to get

patagraphs from a news website however if there is a paragraph with a hyperlink , it does not scrap that line of text. Is there anyway I can get that line of text too?

`!pip3 install requests from bs4 import BeautifulSoup as BS import requests as req import io

url = "https://www.geo.tv/latest/456848-ronaldo-eyes-world-cup-quarters-as-morocco-dare-to-dream"
webpage = req.get(url)
trav = BS(webpage.content, "html.parser")
M = 1
attributes_container = []

for link in trav.find_all('p'):
    
    # PASTE THE CLASS TYPE THAT WE GET
    # FROM THE ABOVE CODE IN THIS
    if(str(type(link.string)) == "<class 'bs4.element.NavigableString'>"
    and len(link.string) > 35):
        x=str(link.string)
        print (x)
        attributes_container.append(x)
        
text_df = pd.DataFrame(attributes_container, columns=["Text"])
text_df

`

In this case for example, the news article has "Cristiano Ronaldo" as a hyperlink so that line does not get scrraped.

CodePudding user response:

Simply use .text. It will consider every text under that tag. Also, to ensure that irrelevant text will not include in your list, you should only focus on the specific div (i.e., class content-area).

content_area = trav.find("div", {"class": "content-area"})
attributes_container = []

for link in content_area.find_all('p'):
    text = link.text
    if len(text) > 35:
        print(text)
        attributes_container.append(text)

output:

                                                 Text
0   DOHA: Cristiano Ronaldo will aim to fire Portu...
1   Just two last-eight slots remain to be filled ...
2   Ronaldo was hogging the headlines at the tourn...
3   Following an exit by "mutual agreement" he is ...
4   The 37-year-old superstar forward, who is appe...
5   After scoring a penalty in his team's opening ...
6                                                ....
  • Related