Home > Software engineering >  How to get all copyable text from a web page python
How to get all copyable text from a web page python

Time:10-16

I am working on a web scraping related project where I need to copy all copyable text from a rendered web page

I have tried some methods like parsing the HTML code with respect to the keywords, but for some reason it dint worked.

CodePudding user response:

Your qn is not the clearest but here's what I got

import bs4, requests

itemurl = "ww.google.com"
response = requests.get(itemurl,headers={'User-Agent': 'Mozilla/5.0'})
print(response.text) # Get all text from site
soup = bs4.BeautifulSoup(response.text,features="html.parser")

# Prints all text that are within <div> with the class `texts`
print([i.text for i in soup.findAll({"div":{"class":"texts"}})])

CodePudding user response:

For parsing HTML tags from static websites you could use things like BeautifulSoup to get contents of specific HTML tags and attributes, like <p>, <div>, <aside > and so on.

But keep in mind that some websites might require JavaScript to load content, and in that case I'd recommend you to either work around their API (look for the HTTP requests they are sending on the browser), or use Selenium which is able to execute JavaScript code (by hooking it up into your favorite web browser) before parsing the resulting HTML tags.

CodePudding user response:

To get all the human readable text of the html <body> you can go with beautiful soup. Select the body and use get_text() toget rid of redundant whitespace, etc. and all strings splited:

import bs4, requests

response = requests.get('https://www.nytimes.com/',headers={'User-Agent': 'Mozilla/5.0'})
soup = bs4.BeautifulSoup(response.text,'lxml')

soup.body.get_text(' ', strip=True)

Output

'Continue reading the main story Sections SEARCH Skip to content Skip to site index U.S. International Canada Español 中文 Today’s Paper World U.S. Politics N.Y. Business Opinion Tech Science Health Sports Arts Books Style Food Travel Magazine T Magazine Real Estate Video World U.S. Politics N.Y. Business Opinion Tech Science Health Sports Arts Books Style Food Travel Magazine T Magazine Real Estate Video Key Part of Biden’s Climate Agenda Is Likely to Be Cut From Budget Bill President Biden’s clean electricity program is said to likely drop after Senator Joe Manchin told the White House that he strongly opposes it. White House staffers are now rewriting the legislation without that provision, and are trying to cobble together a mix of other policies to cut emissions. Senator Joe Manchin has told the White House he opposes the clean electricity program, several officials and lobbyists said. T.J. Kirkpatrick for The New York Times Funding Fight Threatens Plan to Pump Billions Into Affordable Housing A voucher program is at risk of being sharply scaled back as the White House seeks to slash its social policy package to appease two centrist senators. F.D.A. Panel Unanimously Recommends Johnson & Johnson Booster Shots But many panel members said they think J. & J. recipients might benefit from the option of a Pfizer or Moderna booster, something the agency is considering. Got the Johnson & Johnson vaccine? Here’s what to know about its boosters. The mayor of Chicago is fighting the city’s largest police union over vaccinations. Catch up on Covid news. Tracking the Coronavirus › United States\xa0› United States\xa0› United States Avg. on Oct. 15 14-day change New\xa0cases 84,245 –23% New\xa0deaths 1,587 –16% U.S. hot spots › Vaccinations › Global hot spots › Other trackers: Choose your own places to track Global vaccinations Alaska Minn. W.V. Mask mandates Hospitals Vaccine development Other trackers: Global vaccinations Alaska Minn. W.V. Mask mandates Hospitals Vaccine development Choose your own places to track Stabbing Death of Longtime U.K. Lawmaker Is Declared an Act of Terrorism David Amess, a Conservative member of Parliament, was holding a meeting at the time. He is the second politician killed in an attack in about five years. A vigil was held at Saint Peter’s Catholic Church after the stabbing of David Amess. Dan Kitwood/Getty Images Justice Dept. to Ask Supreme Court to Block Texas’ Near-Total Abortion Ban The Texas law bans abortions when cardiac activity is detected, at around six weeks of pregnancy, and makes no exceptions for rape or incest. Ex-Student to Plead Guilty to Parkland School Shooting He will plead guilty to 17 counts of premeditated murder and 17 counts of attempted murder for one of America’s deadliest school shootings. Adams, in a Rebuke of De Blasio, Commits to Keeping Gifted Education Eric Adams, the likely next mayor of New York City, rejected Mayor Bill de Blasio’s plan to end the gifted and talented program, but didn’t give details. Hilary Swift for The New York Times Netflix Star and Staff Pressure Executive Over Dave Chappelle’s Special In a company meeting on Friday, the co-chief executive Ted Sarandos faced criticism from employees over “The Closer,” which some called transphobic. Dave Chappelle Isn’t Canceled. He Just Likes to Talk About It. With his popularity partly built on courting outrage, it’s no surprise he’s doubling down, our columnist writes. Mathieu Bitton/Netflix Concert Halls Are Back. But Visa Backlogs Are Keeping Musicians Out. Visa delays are causing tumult in the classical music industry, leading to a wave of cancellations just as live performances are finally returning. The Danish conductor Thomas Dausgaard with the Seattle Symphony in 2019. James Holt Billie Eilish’s Secret Weapon Comes Out of the Shadows Finneas has won eight Grammys alongside his sister, Billie Eilish, and worked with some of the biggest stars. Now he is arriving as a solo artist. Chantal Anderson for The New York Times Read our Saturday profile: A woman is pushing to improve sex education in Australia — from 10,000 miles away. A patient was suddenly sick and shaking violently. Can you tell what was going on? Opinion Zeynep Tufekci The Unvaccinated May Not Be Who You Think Jill Abramson This Justice Is Taking Over the Supreme Court, and He Won’t Be Alone Michelle Cottle Is It Time for Kyrsten Sinema to Leave the Democratic Party? John McWhorter What I See in the Latest Blackface ‘Scandal’ Paul Krugman The Revolt of the American Worker Stephen Graham Jones You’re Anxious. You’re Afraid. And I Have Just the Solution. David Brooks Scorn and the American Story Kara Swisher Navigating the Dave Chappelle Fracas at Netflix ‘The Ezra Klein Show’ A Crypto Optimist Meets a Crypto Skeptic Attitudes Toward Masks Around the World Peter Coy Don’t Blame Workers for Inflation Roxane Gay Dave Chappelle’s Brittle Ego Jessica Nordell and Yaryna Serkez This Is How Everyday Sexism Could Stop You From Getting That Promotion Jay Caspian Kang How Homeowners’ Associations Get Their Way in California Advertisement Continue reading the main story Your Friday Evening Briefing Here’s what you need to know at the end of the day. Listen to Standout Times Journalism Five articles from around The Times, narrated just for you. Listen to ‘The Daily’ The pandemic’s supply chain crisis was supposed to be over by now. It’s not. Sign Up for the At Home and Away Newsletter Get our best suggestions for how to live a cultured life, wherever you are. More News U.S. Pledges to Pay Family of Those Killed in Botched Kabul Drone Strike The Pentagon offered unspecified amounts to relatives of civilians who died in the attack and agreed to help relocate those who want to move to the U.S. Pete Buttigieg Joins the Parental Leave Debate: ‘This Is Work.’ Some have applauded Mr. Buttigieg’s decision to take time off, but others questioned it as the country faces key transportation issues. Pete Buttigieg, the transportation secretary, has been away on family leave after welcoming baby twins to his family. Stefani Reynolds for The New York Times Watch the Launch of NASA’s Lucy Mission to Jupiter’s Trojan Asteroids The elaborate 12-year journey of the robotic spacecraft will offer close encounters with some of the solar system’s least understood objects. Analysis: How the Nobel Peace Prize Laid Bare the Schism in Russia’s Opposition Dmitri Muratov, a new laureate, engages with the Kremlin, while Aleksei Navalny resists compromise. The Kremlin capitalizes on the fault line. Blast at Afghan Mosque Kills 37 as Shiites Are Targeted Again Illinois Democrats’ Map Aims to Grab 2 G.O.P. Seats in Congress Timely Homer Lifts Astros Over Red Sox in A.L.C.S. Game 1 Lev Parnas Trial Testimony Offers Peek at His Place in Trump’s Orbit Rikers Death Pushes Toll in N.Y.C. Jails to 13 This Year Advertisement Continue reading the main story Mental Health Getty Images Do Those Stress-Relieving Drinks Really Work? Illustration by Mike McQuade; Photographs by Getty Images My Mental Health Issues Have a Name: Bruce Getty Images What Do You Like to Do During a Mental Health Day? Richard Chance New Sites Make It Easier to Find a Therapist of Color Tallulah Fontaine How to Recognize and Treat Postpartum Depression Culture and Lifestyle Carmen Mandato/Getty Images Carlos Correa Is OK With Being the Heel The Houston Astros shortstop is a leader on the field and the team’s spokesman off it, even when it comes to discussing the team’s scandalous past. Kalpesh Lathigra for The New York Times Recognition, at Last, After Decades Decolonizing Art Sutapa Biswas is the subject of two major exhibitions in Britain that explore the country’s imperial legacy. Mark Sommerfeld for The New York Times The Velvet Underground Meets Its Match in Todd Haynes In the director’s hands, music subjects are as much about their cultural moment as about their sound — a good description of the band led by Lou Reed. Sunset Boulevard/Corbis, via Getty Images Visconti’s Operatic Autopsy of German History, Restored Anew The trilogy of “The Damned,” “Death in Venice” and “Ludwig” is whole again, in editions that freshly reveal their conflicted queerness. Magnet Five Horror Films to Stream Now The month’s picks include a contagion film, an ’80s throwback, an unnerving tale of siblings, a faux documentary and a slow-burn thriller. New York Times Cooking David Malosh for The New York Times Skillet Chicken With Peppers and Green Olives Christopher Simpson for The New York Times Ginger-Dill Salmon David Malosh for The New York Times Mushrooms and Dumplings Linda Xiao for The New York Times Pumpkin Maple Cornbread David Malosh for The New York Times Tofu With Peanut Sauce and Coconut-Lime Rice Advertisement Continue reading the main story Recommendations From Wirecutter Men’s Jeans We Love A great pair of men’s jeans can be the foundation of any outfit. Though the jeans themselves may not draw attention, they often elevate whatever else you wear. Sarah Kobos How to Choose Flatware It’s a surprisingly weighty decision, especially since the average American buys only three sets in a lifetime. The Best Chicken Coop and Accessories Raising chickens is a joy. But anyone considering a small backyard flock should understand the good, the bad and the smelly before committing. Learn More About Wirecutter The Times’s product recommendation service tests thousands of items each year to help you find the best of everything. Play Spelling Bee How many words can you make with 7 letters? The Crossword Get clued in with wordplay, every day. New York Times Games Subscribe for full access to The Crossword, The Mini, Spelling Bee and more. Letter Boxed Create words using letters around the square. Tiles Match visual elements and keep your chain going. Vertex Connect the dots to reveal the hidden picture. We’d like your thoughts on the New York Times home page experience. Let us know what you think Site Information Navigation © 2021 The New York Times Company NYTCo Contact Us Accessibility Work with us Advertise T Brand Studio Your Ad Choices Privacy Policy Terms of Service Terms of Sale Site Map Canada International Help Subscriptions'
  • Related