I have an HTML file like this:(More than 100 records)
<div >
<h3 >John Smith</h3>
<span >Center - VAR - Employee I</span>
</div>
<div >
<h3 >Jenna Smith</h3>
<span >West - VAR - Employee I</span>
</div>
<div >
<h3 >Jordan Smith</h3>
<span >East - VAR - Employee II</span>
</div>
I need to extract the names IF they are Employee I, which makes it challenging. How can I select those tags that have Employee I in the next tag? Or should I use a different method? Is it even possible to use condition in this case?
with open("file.html", 'r') as input:
html = input.read()
print(re.search(r'\bEmployee I\b',html).group(0))
Like, how can I specify to go to read previous tag?
CodePudding user response:
import re
from bs4 import BeautifulSoup
with open('inputfile.html', encoding='utf-8') as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
names = [span.parent.find('h3').string
for span in
soup.find_all('span',
class_='light-text',
string=re.compile('Employee I$'))
]
print(names)
gives
['John Smith', 'Jenna Smith']
I've formatted the list comprehension over several lines, for clarity, so that it may be easier to see where to adjust things accordingly to other use cases. Of course, a normal for-loop and appending to a list also works fine; I just like list comprehensions.
The re.compile('Employee I$')
is necessary to avoid matching on 'Employee II'
. The class_
argument is an extra, and may not be needed.
The rest is near self-explanatory, especially with the BeautifulSoup documentation next to it.
Note that if the .string
attribute used to be .text
, in case you're using an older version of BeautifulSoup.
CodePudding user response:
from bs4 import BeautifulSoup
test = '''<div >
<h3 >John Smith</h3>
<span >Center - VAR - Employee I</span>
</div>
<div >
<h3 >Jenna Smith</h3>
<span >West - VAR - Employee I</span>
</div>
<div >
<h3 >Jordan Smith</h3>
<span >East - VAR - Employee II</span>
</div>'''
soup = BeautifulSoup(test)
for person in soup.findAll('div'):
names = person.find('h3').text
employee_nb = person.find('span').text.split('-')[2].strip()
if employee_nb == "Employee I":
print(names)
CodePudding user response:
You could also use css selectors
to select your elements more specific.
As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the SoupSieve project. If you installed Beautiful Soup through pip, SoupSieve was installed at the same time, so you don’t have to do anything extra.
Example
from bs4 import BeautifulSoup
html = '''
<div >
<h3 >John Smith</h3>
<span >Center - VAR - Employee I</span>
</div>
<div >
<h3 >Jenna Smith</h3>
<span >West - VAR - Employee I</span>
</div>
<div >
<h3 >Jordan Smith</h3>
<span >East - VAR - Employee II</span>
</div>
'''
soup = BeautifulSoup(html)
[e.text for e in soup.select('h3:has( :-soup-contains("Employee"))')]
Output
['John Smith', 'Jenna Smith', 'Jordan Smith']