this is , what i have tried so far but unable to get exact content of the page (no extra text required)
import requests
from bs4 import BeautifulSoup
link = 'https://trumpwhitehouse.archives.gov/briefings-statements/remarks-president-trump-farewell-address-nation/'
page = requests.get(link)
soup = BeautifulSoup(page.content,'lxml')
article= soup.findAll('p')
print(article)
CodePudding user response:
res=requests.get(r"https://trumpwhitehouse.archives.gov/briefings-statements/remarks-president-trump-farewell-address-nation/")
soup=BeautifulSoup(res.text,"html.parser")
data=soup.find("div",class_="page-content").find_all("p")
for d in data:
print(d.get_text())
Output:
The White House
THE PRESIDENT: My fellow Americans: Four years ago, we launched a great national effort to rebuild our country, to renew its spirit, and to restore the allegiance of this government to its citizens. In short, we embarked on a mission to make America great again — for all Americans.
....
CodePudding user response:
You want the text only, try this:
import requests
from bs4 import BeautifulSoup
link = 'https://trumpwhitehouse.archives.gov/briefings-statements/remarks-president-trump-farewell-address-nation/'
soup = [
paragraph.getText() for paragraph in
(
BeautifulSoup(requests.get(link).content, 'lxml')
.select(".page-content__content p")
)
]
print("\n".join(soup))