Home > Software engineering >  Using Beautifulsoup to search HTML for a section and its conent
Using Beautifulsoup to search HTML for a section and its conent

Time:09-07

I have a HTML page content that goes like this. I am copying the relevant part of the HTML page. Otherwise, it is too big for copying here.

...........
<a  href="/faculty/search.cfm?JobCat=117">
                Electrical Engineering
            </a>
<br/>
</div>
</div>
<div id="jobDesc">
<p><strong>Tenure Status:</strong> Tenure Track</p>The Pott College of Science, Engineering, and Education at the University of Southern
Indiana invites applications for a tenure-track Assistant or Associate Professor of Engineering with a start date of August 2023. Doctorate
in Electrical Engineering or closely related field required. ABD candidates with a Masters in Electrical Engineering or closely related
field, will be considered at the Instructor level. Conferral of a doctoral degree is required no later than December 31, 2023. This position
includes classroom teaching assignments in electrical engineering and technology courses, as well as freshman engineering. The selected
candidate is also expected to direct student design projects. It is expected that this faculty member will have a research agenda that leads
to scholarly works and the engagement of undergraduate students. The position will teach classes in the areas of electrical power and
machines and/or signal processing, but all areas of electrical engineering are welcome.<br><br>The USI Department of Engineering offers
degree programs in Mechanical Engineering, Manufacturing Engineering, Electrical Engineering, Manufacturing Engineering Technology,
Industrial Supervision, and General Engineering with emphasis areas of Electrical, Mechanical, Industrial, Civil, and Mechatronics. The
University of Southern Indiana is committed to excellence in teaching, scholarship and professional activity, and service to the university
and the community. USI is located in the beautiful rolling hills of the southwestern Ohio River valley.<br><br>Applications:<br/>Click "Apply
for this job" near the top right of this page to complete an application and upload application materials to the attention of Dr. Paul
Kuban, Chair of the Search Committee. Application materials should include:<br/>(1) letter of application<br/>(2) current curriculum
vitae<br/>(3) names and full contact information including e-mail addresses for three professional references<br/>(4) unofficial
transcripts<br/>(5) brief statement of teaching philosophy.<br/><br/>Materials should be provided electronically within this web-based
application system. Official transcripts will be required at a later stage. Review of candidates will begin immediately and continue until
the position is filled.<p><strong>Job Description Summary</strong></p><span>The Pott College of Science, Engineering, and Education at the
University of Southern Indiana invites applications for a tenure-track <strong>Assistant or Associate Professor of Engineering</strong> with
a start date of August 2023. Doctorate in Electrical Engineering or closely related field required. ABD candidates with a Masters in
Electrical Engineering or closely related field, will be considered at the Instructor level. Conferral of a doctoral degree is required no
later than December 31, 2023. This position includes classroom teaching assignments in electrical engineering and technology courses, as
well as freshman engineering. The selected candidate is also expected to direct student design projects. It is expected that this faculty
member will have a research agenda that leads to scholarly works and the engagement of undergraduate students. The position will teach
classes in the areas of electrical power and machines and/or signal processing, but all areas of electrical engineering are
welcome.<br/><br/>The USI Department of Engineering offers degree programs in Mechanical Engineering, Manufacturing Engineering, Electrical
Engineering, Manufacturing Engineering Technology, Industrial Supervision, and General Engineering with emphasis areas of Electrical,
Mechanical, Industrial, Civil, and Mechatronics. The University of Southern Indiana is committed to excellence in teaching, scholarship and
professional activity, and service to the university and the community. USI is located in the beautiful rolling hills of the southwestern
Ohio River valley.<br/><br/><strong>Applications:</strong><br/>Click "Apply for this job" near the top right of this page to complete an
application and upload application materials to the attention of Dr. Paul Kuban, Chair of the Search Committee. Application materials should
include:<br/>(1) letter of application<br/>(2) current curriculum vitae<br/>(3) names and full contact information including e-mail addresses
for three professional references<br/>(4) unofficial transcripts<br/>(5) brief statement of teaching philosophy. <br/><br/>Materials should be
provided electronically within this web-based application system. Official transcripts will be required at a later stage. Review of
candidates will begin immediately and continue until the position is filled.</span><p><strong>Interview
Accommodations</strong></p><span>Persons with disabilities requiring accommodations in the application and interview process please contact
the manager of Employment at <a href="[email protected]">[email protected]</a> or (812) 464-1840. Contacting the manager of
Employment is intended for use in seeking disability-related accommodations only. For general applicant inquiries, contact Human Resources
at <a href="[email protected]">[email protected]</a> or (812) 464-1815.</span><p><strong>EEO Statement</strong></p><span>The University
of Southern Indiana is an EEO/AA employer. All individuals including minorities, women, individuals with disabilities and veterans are
encouraged to apply.</span>
</br></br></br></br></div>
<div id="jobStatement">
.............

Expected answer: Extract the following information

'Application material':'Click "Apply for this job" near the top right of this page.  Application materials should include:(1) current curriculum vitae (2) transcrip'

My code:

soup = BeautifulSoup(html.text)

d = {'Application material':soup.find_all('div',id='jobDesc')}

Present answer:

[<div id="jobDesc">
 <p><strong>Tenure Status:</strong> Tenure Track</p>The Pott College of Science, Engineering, and Education at the University of Southern
......USI is located in the beautiful rolling hills of the southwestern
 Ohio River valley.<br/><br/><strong>Applications:</strong><br/>Click "Apply for this job" near the top right of this page to complete an
 application and upload application materials to the attention of Dr. Paul Kuban, Chair of the Search Committee. Application materials should
 include:<br/>(1) letter of application<br/>(2) current curriculum vitae<br/>(3) names and full contact information including e-mail addresses
 for three professional references<br/>(4) unofficial transcripts<br/>(5) brief statement of teaching philosophy. <br/><br/>Materials should be
 provided electronically within this web-based application system. Official .... of Southern Indiana is an EEO/AA employer. All individuals including minorities, women, individuals with disabilities and veterans are
 encouraged to apply.</span>
 </div>]

CodePudding user response:

You can use findAll(text=True) to get all the text elements.

Then remove the first one ([1:]) of the result to exclude the "The company gives wide benefits" part.

Then, use join() to make it a string:

from bs4 import BeautifulSoup

html = """
The company gives wide benefits.<br/><br/><strong>Applications:</strong><br/>Click "Apply for this job" near the top right of this page. Application materials should
include:<br/>(1) current curriculum vitae<br/>(2) unofficial transcripts<br/><br/> Submit these materials electronically on the website.
"""

soup = BeautifulSoup(html, 'html.parser')
childs = soup.findAll(text=True, recursive=True)[1:]

text = ' '.join(childs)

print(text)

Result:

Applications: Click "Apply for this job" near the top right of this page. Application materials should
include: (1) current curriculum vitae (2) unofficial transcripts  Submit these materials electronically on the website

CodePudding user response:

You can try the next example. It should work

from bs4 import BeautifulSoup

html = '''
<html>
 <body>
  <div id="jobDesc">
   <p>
    <strong>
     Tenure Status:
    </strong>
    Tenure Track
   </p>
   The Pott College of Science, Engineering, and Education at the University of Southern
......USI is located in the beautiful rolling hills of the southwestern
 Ohio River valley.
   <br/>
   <br/>
   <strong>
    Applications:
   </strong>
   <br/>
   Click "Apply for this job" near the top right of this page to complete an
 application and upload application materials to the attention of Dr. Paul Kuban, Chair of the Search Committee. Application materials should
 include:
   <br/>
   (1) letter of application
   <br/>
   (2) current curriculum vitae
   <br/>
   (3) names and full contact information including e-mail addresses
 for three professional references
   <br/>
   (4) unofficial transcripts
   <br/>
   (5) brief statement of teaching philosophy.
   <br/>
   <br/>
   Materials should be
 provided electronically within this web-based application system. Official .... of Southern Indiana is an EEO/AA employer. All individuals including minorities, women, individuals with disabilities and veterans are
 encouraged to apply.
  </div>
 </body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
#print(soup.prettify())

d = {'Application material':''.join([x.get_text(strip=True).split('Applications:')[-1].replace('\n','') for x in soup.select('div#jobDesc:-soup-contains("Applications:")')])}
print(d)

Output:

{'Application material': 'Click "Apply for this job" near the top right of this page to complete an  application and upload application materials to the attention of Dr. Paul Kuban, Chair of the Search Committee. Application materials should  include:(1) letter of application(2) current curriculum vitae(3) names and full contact information including e-mail addresses  for three professional references(4) unofficial transcripts(5) brief statement of teaching philosophy.Materials should be  provided electronically within this web-based application system. Official .... of Southern Indiana is an EEO/AA employer. All individuals including minorities, women, individuals with disabilities and 
veterans are  encouraged to apply.'}
  • Related