Home > Software engineering >  Beautiful Soup - Strip Page Content for NLP
Beautiful Soup - Strip Page Content for NLP

Time:10-01

I'm creating a news parser which can summarize news from different sites and create keywords based off the news content. Most news sources wrap the news content inside the article tag, so I'm extracting it from the sites to get the content.

The problem is, when using beautiful soup it will return the raw HTML inside the article tag, which sometimes contains images, links and tags like <b>. My question is, is there a simple way to get the written content of the page like an user sees it? That means ignoring everything that isn't text. The only that I have is looping through every tag inside the article and checking the inner HTML for text content. The reasons I haven't already done that are:

  • there may be multiple tags inside tags which I'd need to parse;
  • there are tags which I'd need to ignore, such as script tags, which the browser doesn't display;
  • there may be a builtin way to do that inside the beautiful soup library or another HTML focused library

An example, the following p tag

<p>
    hello <b>world</b> </br> <img src="world.png">. fine <a href="#"> day </a> isn't it?
</p>

would become

hello world. fine day isn't it?

So, is there any better way to extract the page text information using Beautiful Soup or another html parsing library? Note: I don't care about rendering JS - script tags can be ignored.

CodePudding user response:

Use getText() to get only the 'text':

p = soup.find('p')
print(p.getText())

    hello world  . fine  day  isn't it?


To remove all leading/trailing whitespaces, add an strip();

print(p.getText().strip())
hello world  . fine  day  isn't it?

The extra space between world and the . is a leftover from the image. If you're sure every image will be after a space, you could technically remove those.

CodePudding user response:

Try to use .get_text(strip=True)

txt = soup.select_one('p').get_text(strip=True).replace('.','')
print(txt)

Output:

hello world fine  day  isn't it?
  • Related