How to select the previous tag when re finds the str-CodePudding

I have an HTML file like this:(More than 100 records)

<div >
    <h3 >John Smith</h3>
        <span >Center - VAR - Employee I</span>
</div>

<div >
    <h3 >Jenna Smith</h3>
        <span >West - VAR - Employee I</span>
</div>

<div >
    <h3 >Jordan Smith</h3>
        <span >East - VAR - Employee II</span>
</div>

I need to extract the names IF they are Employee I, which makes it challenging. How can I select those tags that have Employee I in the next tag? Or should I use a different method? Is it even possible to use condition in this case?

with open("file.html", 'r') as input:
html = input.read()
    print(re.search(r'\bEmployee I\b',html).group(0))

Like, how can I specify to go to read previous tag?

CodePudding user response：

import re
from bs4 import BeautifulSoup

with open('inputfile.html', encoding='utf-8') as fp:
    soup = BeautifulSoup(fp.read(), 'html.parser')

names = [span.parent.find('h3').string 
         for span in 
         soup.find_all('span', 
                       class_='light-text', 
                       string=re.compile('Employee I$'))
        ]
print(names)

gives

['John Smith', 'Jenna Smith']

I've formatted the list comprehension over several lines, for clarity, so that it may be easier to see where to adjust things accordingly to other use cases. Of course, a normal for-loop and appending to a list also works fine; I just like list comprehensions.

The re.compile('Employee I$') is necessary to avoid matching on 'Employee II'. The class_ argument is an extra, and may not be needed.

The rest is near self-explanatory, especially with the BeautifulSoup documentation next to it.

Note that if the .string attribute used to be .text, in case you're using an older version of BeautifulSoup.

CodePudding user response：

from bs4 import BeautifulSoup

test = '''<div >
        <h3 >John Smith</h3>
                <span >Center - VAR - Employee I</span>
        </div>

        <div >
            <h3 >Jenna Smith</h3>
                <span >West - VAR - Employee I</span>
        </div>

        <div >
            <h3 >Jordan Smith</h3>
                <span >East - VAR - Employee II</span>
        </div>'''

soup = BeautifulSoup(test)
for person in soup.findAll('div'):
    names = person.find('h3').text
    employee_nb = person.find('span').text.split('-')[2].strip()
    if employee_nb == "Employee I":
        print(names)

CodePudding user response：

You could also use css selectors to select your elements more specific.

As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the SoupSieve project. If you installed Beautiful Soup through pip, SoupSieve was installed at the same time, so you don’t have to do anything extra.

Example

from bs4 import BeautifulSoup

html = '''
<div >
    <h3 >John Smith</h3>
        <span >Center - VAR - Employee I</span>
</div>

<div >
    <h3 >Jenna Smith</h3>
        <span >West - VAR - Employee I</span>
</div>

<div >
    <h3 >Jordan Smith</h3>
        <span >East - VAR - Employee II</span>
</div>
'''
soup = BeautifulSoup(html)

[e.text for e in soup.select('h3:has( :-soup-contains("Employee"))')]

Output

['John Smith', 'Jenna Smith', 'Jordan Smith']