Home > Back-end >  Scrape values inside span class webpage with beautifulsoup python
Scrape values inside span class webpage with beautifulsoup python

Time:03-13

Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.

            <div >
                <p>
                  <span >File Number</span><br>
                  A-21-897274
                </p>
            </div>
            <div >
              <p>
                <span >Location</span><br>
                Ohio
              </p>
            </div>
              <div >
                <p>
                  <span >Date</span><br>
                  07/01/2022
                </p>
              </div>
          </div>

I need the span titles:
File Number, Location, Date

and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"

I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.

What I've tried:

import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):

# get the last sibling
*_, value_tag = title_tag.next_siblings

title = title_tag.text.strip()

if isinstance(value_tag, bs4.element.Tag):
    value = value_tag.text.strip()
else:  # it's a navigable string element
    value = value_tag.strip()

print(title, value)

output:

File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"

This will print out everything I need BUT it also prints out 100's of other values I don't want/need.

CodePudding user response:

You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:

from bs4 import BeautifulSoup


html_doc = """
<div >
    <p>
      <span >File Number</span><br>
      A-21-897274
    </p>
</div>
<div >
  <p>
    <span >Location</span><br>
    Ohio
  </p>
</div>
  <div >
    <p>
      <span >Date</span><br>
      07/01/2022
    </p>
  </div>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")


def correct_tag(tag):
    return tag.name == "span" and tag.get_text(strip=True) in {
        "File Number",
        "Location",
        "Date",
    }


for t in soup.find_all(correct_tag):
    print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")

Prints:

File Number: A-21-897274
Location: Ohio
Date: 07/01/2022
  • Related