Home > Software design >  How do I retrieve the text of a webpage without sentences being broken by newlines?
How do I retrieve the text of a webpage without sentences being broken by newlines?

Time:10-24

I would like to retrieve the text from a webpage - my preferred language is Python - so that sentences are not broken mid-sentence by newlines, like this:

and then the community
decided to invest in
public parks for the
benefit of the citizens.

I have tried dumping web pages from lynx and w3m, but it breaks the sentences into lines.

I just tried using Beautiful Soup's .get_text() method, which should pull out unbroken strings from elements containing text, such as <p> tags, but to my surprise, it still breaks sentences into newlines. Maybe this has something to do with newlines already being there in the HTML, or the text having tags like links embedded, breaking the flow of text. (I tried it on this webpage.)

I can open the webpage in a browser, select all and copy and paste into a text file, and this preserves the sentences as single lines, but this is not a programmatic solution.

I will try to show GPT-3 an example of the way I would like it to join broken sentences but not lines of code and see if it can replicate the example, but this is fixing the extraction afterwards, rather than before.

What would be a simple, programmatic way to obtain unbroken sentences from the source HTML?

I would really appreciate anybody who can help me with this.

CodePudding user response:

Here ya go:


def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

And you can optionally add in another line that removes all \n \r and \t following the same join pattern. For example:

text = text.replace("\n", " ")

can be added right before the return text line in the function. Adding that line for me causes the text in your example to render as: and then the community decided to invest in public parks for the benefit of the citizens. which may also be a use case you want.

  • Related