I am a journalist by profession and learned python for news article scrapping. Using BeautifulSoup I am able to get
patagraphs from a news website however if there is a paragraph with a hyperlink , it does not scrap that line of text. Is there anyway I can get that line of text too?
`!pip3 install requests from bs4 import BeautifulSoup as BS import requests as req import io
url = "https://www.geo.tv/latest/456848-ronaldo-eyes-world-cup-quarters-as-morocco-dare-to-dream"
webpage = req.get(url)
trav = BS(webpage.content, "html.parser")
M = 1
attributes_container = []
for link in trav.find_all('p'):
# PASTE THE CLASS TYPE THAT WE GET
# FROM THE ABOVE CODE IN THIS
if(str(type(link.string)) == "<class 'bs4.element.NavigableString'>"
and len(link.string) > 35):
x=str(link.string)
print (x)
attributes_container.append(x)
text_df = pd.DataFrame(attributes_container, columns=["Text"])
text_df
`
In this case for example, the news article has "Cristiano Ronaldo" as a hyperlink so that line does not get scrraped.
CodePudding user response:
Simply use .text
. It will consider every text under that tag. Also, to ensure that irrelevant text will not include in your list, you should only focus on the specific div
(i.e., class content-area
).
content_area = trav.find("div", {"class": "content-area"})
attributes_container = []
for link in content_area.find_all('p'):
text = link.text
if len(text) > 35:
print(text)
attributes_container.append(text)
output:
Text
0 DOHA: Cristiano Ronaldo will aim to fire Portu...
1 Just two last-eight slots remain to be filled ...
2 Ronaldo was hogging the headlines at the tourn...
3 Following an exit by "mutual agreement" he is ...
4 The 37-year-old superstar forward, who is appe...
5 After scoring a penalty in his team's opening ...
6 ....