Home > Enterprise >  Beautifulsoup how to extract paragraph from this page perfectly? only paragraph
Beautifulsoup how to extract paragraph from this page perfectly? only paragraph

Time:09-22

this is , what i have tried so far but unable to get exact content of the page (no extra text required)

import requests
from bs4 import BeautifulSoup

link = 'https://trumpwhitehouse.archives.gov/briefings-statements/remarks-president-trump-farewell-address-nation/'

page = requests.get(link)
soup = BeautifulSoup(page.content,'lxml')
article= soup.findAll('p')
print(article)

CodePudding user response:

res=requests.get(r"https://trumpwhitehouse.archives.gov/briefings-statements/remarks-president-trump-farewell-address-nation/")
soup=BeautifulSoup(res.text,"html.parser")
data=soup.find("div",class_="page-content").find_all("p")
for d in data:
    print(d.get_text())

Output:

The White House
THE PRESIDENT: My fellow Americans: Four years ago, we launched a great national effort to rebuild our country, to renew its spirit, and to restore the allegiance of this government to its citizens. In short, we embarked on a mission to make America great again — for all Americans.
....

CodePudding user response:

You want the text only, try this:

import requests
from bs4 import BeautifulSoup

link = 'https://trumpwhitehouse.archives.gov/briefings-statements/remarks-president-trump-farewell-address-nation/'
soup = [
    paragraph.getText() for paragraph in
    (
        BeautifulSoup(requests.get(link).content, 'lxml')
        .select(".page-content__content p")
     )
]
print("\n".join(soup))
  • Related