Home > Software design >  Extract sentence from HTML using python
Extract sentence from HTML using python

Time:12-20

I have extracted a component of interest from a HTML file using python(BeautifulSoup) My code:

import pandas as pd
import numpy as np
from lxml import html
from html.parser import HTMLParser
from bs4 import BeautifulSoup


HTMLFile = open("/home/kospsych/Desktop/projects/dark_web/file", "r")

index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml')

Tag = S.select_one('.inner')


print(Tag)

This prints the result of :

<div  id="msg_550811">Does anyone know if it takes a set length of time to be given verified vendor status by sending a signed PGP message to the admin (in stead of paying the vendor bond)?<br/><br/>I'm regularly on Agora but I want to join the Abraxas club as well.<br/><br/>Mindful-Shaman</div>

and of type:

<class 'bs4.element.Tag'>

I would like somehow to remove the div tag and the br tags and just result with a string which will be the above sentence. How could this be done efficiently?

CodePudding user response:

You can use .text or .get_text() method:

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    """<div  id="msg_550811">Does anyone know if it takes a set length of time to be given verified vendor status by sending a signed PGP message to the admin (in stead of paying the vendor bond)?<br/><br/>I'm regularly on Agora but I want to join the Abraxas club as well.<br/><br/>Mindful-Shaman</div>""",
    "html.parser",
)

Tag = soup.select_one(".inner")
print(Tag.get_text(strip=True, separator=" "))

Prints:

Does anyone know if it takes a set length of time to be given verified vendor status by sending a signed PGP message to the admin (in stead of paying the vendor bond)? I'm regularly on Agora but I want to join the Abraxas club as well. Mindful-Shaman
  • Related