Home > Mobile >  Extracting Messy, Untagged HTML text using Beautiful Soup in Python
Extracting Messy, Untagged HTML text using Beautiful Soup in Python

Time:04-18

I am trying to parse a webpage with a bunch of untagged text using BeautifulSoup. As seen in the example below, the pattern is a name in STRONG tags, followed by a series of untagged text interleaved with line breaks. At the end of each "group" of text there is an <hr> tag to denote the beginning of the next section.

I would like to stick this information in a csv file for the time being. My current thought process is to use soup.find_all("b") to get all of the names. For each name retrieved I would manually cycle thru siblings using something like next_sibling, adding the lines of text to my csv file and ignoring the line breaks. After reaching an <hr> element, move to the next "name" from the soup.find_all("b") results and advance the csv to the next line.

I am not sure if this line of thinking will actually translate to success. For one, I haven't yet figured out how to select each line of untagged text. The various examples I have been able to find involve selecting all untagged text on a page simultaneously, which doesn't do me much good. The other issue is that I am not sure if my suggested method of "navigating" the page contents is logically correct. Trying to get the next_sibiling of an element churned out by soup.find_all("b") returns none in the experiments I've done. Haven't figured that one out yet either.

I admittedly don't have much experience with Beautiful Soup and it has been a minute since I have worked with HTML in general. Looking forward to learning more about this!

<div >
    <b>Thing 1</b>
    <br>
    Text About Thing 1
    <br>
    More Text About Thing 1
    <br>
    Even More Text About Thing 1
    <br>
    Even MORE Text About Thing 1
    <br>
    <hr>
    <b>Thing 2</b>
    <br>
    Text About Thing 2
    <br>
    More Text About Thing 2
    <br>
    Even More Text About Thing 2
    <br>
    Even MORE Text About Thing 2
    <br>
    <hr>
    <b>Thing 3</b>
    <br>
    Text About Thing 3
    <br>
    More Text About Thing 3
    <br>
    Even More Text About Thing 3
    <br>
    Even MORE Text About Thing 3
    <br>
    <hr>
</div>

CodePudding user response:

Another version: you can replace all <hr> in main section with separator of your choose and then use itertools.groupby to get separate blocks of texts, for example:

from itertools import groupby
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your HTML from the question

maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
    hr.replace_with("-" * 80)

text = maincontent.get_text(strip=True, separator="\n")

for is_separator, g in groupby(text.splitlines(), lambda k: k == "-" * 80):
    if not is_separator:
        print(" ".join(g))  # <-- or store it to file instead printing to screen

Prints:

Thing 1 Text About Thing 1 More Text About Thing 1 Even More Text About Thing 1 Even MORE Text About Thing 1
Thing 2 Text About Thing 2 More Text About Thing 2 Even More Text About Thing 2 Even MORE Text About Thing 2
Thing 3 Text About Thing 3 More Text About Thing 3 Even More Text About Thing 3 Even MORE Text About Thing 3

Or just use normal str.split:

soup = BeautifulSoup(html_doc, "html.parser")

maincontent = soup.select_one(".maincontent")
for hr in maincontent.select("hr"):
    hr.replace_with("-" * 80)

text = maincontent.get_text(strip=True, separator="\n")

for group in map(str.strip, text.split("-" * 80)):
    if group:
        print(group)
        print()

Prints 3 blocks:

Thing 1
Text About Thing 1
More Text About Thing 1
Even More Text About Thing 1
Even MORE Text About Thing 1

Thing 2
Text About Thing 2
More Text About Thing 2
Even More Text About Thing 2
Even MORE Text About Thing 2

Thing 3
Text About Thing 3
More Text About Thing 3
Even More Text About Thing 3
Even MORE Text About Thing 3

CodePudding user response:

The path described sounds conclusive, and from my point of view you would almost have reached your goal. Cause expected output in not clear from the question, this one just points into a direction:

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(html)
with open('myfile.csv', 'w') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')

    for b in soup.select('b'):
        d = [b.text]
        for t in b.next_siblings:
            if t.name == 'b':
                break
            if not t.name and t.strip() != '':
                d.append(t.strip())
        writer.writerow(d)

Output

Thing 1,Text About Thing 1,More Text About Thing 1,Even More Text About Thing 1,Even MORE Text About Thing 1
Thing 2,Text About Thing 2,More Text About Thing 2,Even More Text About Thing 2,Even MORE Text About Thing 2
Thing 3,Text About Thing 3,More Text About Thing 3,Even More Text About Thing 3,Even MORE Text About Thing 3
  • Related