Home > Back-end >  Python Scrap Same item from all subpages using BeautifulSoup
Python Scrap Same item from all subpages using BeautifulSoup

Time:09-11

I am trying to scrap "salary" from each subpage. For one of the subpage, I am copying the specific contents of the soup =BeautifulSoup(requests.get('url_of_job').text. I copied soup content to a word file and sliced the content surrounding salary and copied here. Copying all text crosses the limit here.

soup =

```
<script type="application/ld json">
                    {
                        "@context"              :   "http://schema.org/",
                        "@type"                 :   "JobPosting",
                        "industry"              :   ["Academic/Education","Education"],
                        "title"                 :   "Assistant Professor of Engineering F99507",
                        "description"           :   "&lt;p&gt;&lt;strong&gt;&lt;u&gt;MCNEESE STATE UNIVERSITY&lt;/u&gt;&lt;/strong&gt;&lt;strong&gt; invites applicants for the position of Assistant Professor of Engineering.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DEPARTMENT: &lt;/strong&gt;Engineering and Computer Science&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;POSITION INFORMATION: &lt;/strong&gt;This is a full-time, 9-month, unclassified, tenure-track position. The appointment begins in January 2023.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;POSITION NUMBER: &lt;/strong&gt;F99507&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SALARY: &lt;/strong&gt;$76,000&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;REPORTING AUTHORITY: &lt;/strong&gt;Department Head of Engineering and Computer Science&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;POSITION DUTIES/RESPONSIBILITIES: &lt;/strong&gt;The Assistant Professor of Engineering is responsible for functions related to teaching courses, advising students, and conducting research. The assistant professor will work closely with faculty and staff in the Department of Engineering and Computer Science.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;REQUIRED/PREFERRED QUALIFICATIONS: &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;u&gt;Required&lt;/u&gt;:&amp;nbsp;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Master of Engineering with 18 hours in Electrical Engineering&lt;/li&gt;
&lt;li&gt;Experience with Power Engineering&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;u&gt;Preferred&lt;/u&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PhD in Electrical Engineering&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;DEADLINE: &lt;/strong&gt;Position will remain open until filled.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;APPLICATION PROCESS AND MATERIALS: &lt;/strong&gt;For the application process, applicants are required and &lt;strong&gt;&lt;span style=&quot;text-decoration: underline;&quot;&gt;MUST&lt;/span&gt;&lt;/strong&gt; complete an electronic application and upload required documents listed below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Letter of Application (Cover Letter)&lt;/li&gt;
&lt;li&gt;Resume/Vitae&lt;/li&gt;
&lt;li&gt;Three Professional References (include: name, phone number, and e-mail address)&lt;/li&gt;
&lt;li&gt;Unofficial Transcripts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;TO APPLY FOR THIS VACANCY, &lt;/strong&gt;click on the &lt;strong&gt;&amp;ldquo;APPLY&amp;rdquo;&lt;/strong&gt; button at the top of advertisement to complete the electronic application, which can be used for this vacancy as well as future job opportunities. Applicants are responsible for checking the status of their application to determine where they are in the recruitment process. Further status message information is located under the Information section of the Current Job Opportunities page.&lt;/p&gt;
&lt;p&gt;ALL JOB OFFERS ARE CONTINGENT UPON THE SUCCESSFUL RESULT OF A CRIMINAL BACKGROUND CHECK AND RECEIPT OF TRANSCRIPT(S) IF APPLICABLE.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;If you have questions regarding this recruitment, please contact the HR Liaison:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Kim Dronett at &lt;a rel=&quot;noopener&quot; href=&quot;[email protected]&quot; target=&quot;_blank&quot;&gt;[email protected]&lt;/a&gt; or (337) 475-5413.&lt;/p&gt;",
                        "datePosted"            :   "2022-09-02 09:48:23.527",

                    
                    
                        "validThrough"          :   "2023-03-01 23:59:59.9",
                    
                        "baseSalary"            :   {
                                                    "@type"         :   "MonetaryAmount",
                                                    "currency"      :   "USD",
                                                    "value"         :   {
                                                                                "@type"     :   "QuantitativeValue",
                                                                            
                                                                                "value" :   "76,000.00",
                                                                                                                                                        
                                                                                "unitText"  :   "YEAR"
                                                                            }
                                                    },
                                
                        "hiringOrganization"    :   {
                                                        "@type"     :   "Organization",
                                                        
                                                            "id"        :   "http://www.mcneese.edu",
                                                            "url"       :   "http://www.mcneese.edu",
                                                                                                            
                                                        "name"      :   "McNeese State University"
                                                    },
                        "jobLocation"           :   {
                                                        "@type"     :   "Place",
                                                        "address"   :   {
                                                                            "@type"             :   "PostalAddress",
                                                                            "addressLocality"   :   "Lake Charles",
                                                                            "addressRegion"     :   "LA",
                                                                            "addressCountry"    :   "US"
                                                                        }
                                                    },
                    

                        "employmentType"        :   "FULL_TIME",
                        "image"                 :   "https://www.higheredjobs.com/assets/hej/img/HEJ_Logo_2c.png"
                    }
                </script>
<noscript>
<style type="text/css">
                @media (min-width: 768px) { /*menus work different at XS*/
                    HEADER ul.nav li.dropdown:hover > .arrow,
                    HEADER ul.nav li.dropdown:hover > ul.dropdown-menu {
                        display: block;
                    }
                }
            </style>
</noscript>

<div >
<h5>Resources</h5>
<ul>
<li><a href="/career/">Career Resources</a></li>
<li><a href="/salary/">Salary Data</a></li>
<li><a href="/career/resumes.cfm">Job Search Tips</a></li>
<li><a href="/career/ResumeService.cfm">Resume/CV Writing<br/> Service</a></li>
<li><a href="/articles/DiversityResources.cfm">Diversity Resources</a></li>
<li><a href="/career/SiteListings.cfm">Search Firms</a></li>
</ul>
</div>
</div>
</li>
</ul>
</li>
<li >
<a  data-toggle="dropdown" href="/employers/">


<div  id="jobAttrib">
<div >
<strong>Type:</strong>
        Full-Time
        <br/>
<strong>Salary:</strong>
            $76,000.00 USD Per Year
            <br/>
<strong>Posted:</strong>
            09/02/2022 
            <br/>
<strong>Application Due:</strong>
            Open Until Filled
            <br/>
<strong>Category:</strong>
<a  href="/faculty/search.cfm?JobCat=117">
                Electrical Engineering
            </a>
<br/>
</div>
</div>

</div>
</div>
<div id="jobDesc">
<p><strong><u>MCNEESE STATE UNIVERSITY</u></strong><strong> invites applicants for the position of Assistant Professor of
Engineering.</strong></p> <p><strong>DEPARTMENT: </strong>Engineering and Computer Science </p> <p><strong>POSITION INFORMATION:
</strong>This is a full-time, 9-month, unclassified, tenure-track position. The appointment begins in January 2023.</p> <p><strong>POSITION
NUMBER: </strong>F99507</p> <p><strong>SALARY: </strong>$76,000</p> <p><strong>REPORTING AUTHORITY: </strong>Department Head of Engineering
and Computer Science</p> <p><strong>POSITION DUTIES/RESPONSIBILITIES: </strong>The Assistant Professor of Engineering is responsible for
functions related to teaching courses, advising students, and conducting research. The assistant professor will work closely with faculty
and staff in the Department of Engineering and Computer Science.</p> <p><strong>REQUIRED/PREFERRED QUALIFICATIONS: </strong></p>
<p><u>Required</u>: </p> <ul>
<li>Master of Engineering with 18 hours in Electrical Engineering</li>
<li>Experience with Power
Engineering</li>
</ul> <p><u>Preferred</u>:</p> <ul>
<li>PhD in Electrical Engineering</li>
</ul> <p><strong>DEADLINE: </strong>Position
will remain open until filled.   
``` 

My code:

'salary': soup.select_one('strong:-soup-contains("Salary:")').get_text(strip=True)

Present solution:

NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

Expected solution:

'salary':$76,000.00 USD Per Year

CodePudding user response:

Here is a solution which allows you to programmatically access that job's info:

from bs4 import BeautifulSoup
import json
html = '''
<script type="application/ld json">
                    {
                        "@context"              :   "http://schema.org/",
                        "@type"                 :   "JobPosting",
                        "industry"              :   ["Academic/Education","Education"],
                        "title"                 :   "Assistant Professor of Engineering F99507",
                        "description"           :   "&lt;p&gt;&lt;strong&gt;&lt;u&gt;MCNEESE STATE UNIVERSITY&lt;/u&gt;&lt;/strong&gt;&lt;strong&gt; invites applicants for the position of Assistant Professor of Engineering.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DEPARTMENT: &lt;/strong&gt;Engineering and Computer Science&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;POSITION INFORMATION: &lt;/strong&gt;This is a full-time, 9-month, unclassified, tenure-track position. The appointment begins in January 2023.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;POSITION NUMBER: &lt;/strong&gt;F99507&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SALARY: &lt;/strong&gt;$76,000&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;REPORTING AUTHORITY: &lt;/strong&gt;Department Head of Engineering and Computer Science&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;POSITION DUTIES/RESPONSIBILITIES: &lt;/strong&gt;The Assistant Professor of Engineering is responsible for functions related to teaching courses, advising students, and conducting research. The assistant professor will work closely with faculty and staff in the Department of Engineering and Computer Science.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;REQUIRED/PREFERRED QUALIFICATIONS: &lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;u&gt;Required&lt;/u&gt;:&amp;nbsp;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Master of Engineering with 18 hours in Electrical Engineering&lt;/li&gt;
&lt;li&gt;Experience with Power Engineering&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;u&gt;Preferred&lt;/u&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PhD in Electrical Engineering&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;DEADLINE: &lt;/strong&gt;Position will remain open until filled.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;APPLICATION PROCESS AND MATERIALS: &lt;/strong&gt;For the application process, applicants are required and &lt;strong&gt;&lt;span style=&quot;text-decoration: underline;&quot;&gt;MUST&lt;/span&gt;&lt;/strong&gt; complete an electronic application and upload required documents listed below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Letter of Application (Cover Letter)&lt;/li&gt;
&lt;li&gt;Resume/Vitae&lt;/li&gt;
&lt;li&gt;Three Professional References (include: name, phone number, and e-mail address)&lt;/li&gt;
&lt;li&gt;Unofficial Transcripts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;TO APPLY FOR THIS VACANCY, &lt;/strong&gt;click on the &lt;strong&gt;&amp;ldquo;APPLY&amp;rdquo;&lt;/strong&gt; button at the top of advertisement to complete the electronic application, which can be used for this vacancy as well as future job opportunities. Applicants are responsible for checking the status of their application to determine where they are in the recruitment process. Further status message information is located under the Information section of the Current Job Opportunities page.&lt;/p&gt;
&lt;p&gt;ALL JOB OFFERS ARE CONTINGENT UPON THE SUCCESSFUL RESULT OF A CRIMINAL BACKGROUND CHECK AND RECEIPT OF TRANSCRIPT(S) IF APPLICABLE.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;If you have questions regarding this recruitment, please contact the HR Liaison:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Kim Dronett at &lt;a rel=&quot;noopener&quot; href=&quot;[email protected]&quot; target=&quot;_blank&quot;&gt;[email protected]&lt;/a&gt; or (337) 475-5413.&lt;/p&gt;",
                        "datePosted"            :   "2022-09-02 09:48:23.527",

                    
                    
                        "validThrough"          :   "2023-03-01 23:59:59.9",
                    
                        "baseSalary"            :   {
                                                    "@type"         :   "MonetaryAmount",
                                                    "currency"      :   "USD",
                                                    "value"         :   {
                                                                                "@type"     :   "QuantitativeValue",
                                                                            
                                                                                "value" :   "76,000.00",
                                                                                                                                                        
                                                                                "unitText"  :   "YEAR"
                                                                            }
                                                    },
                                
                        "hiringOrganization"    :   {
                                                        "@type"     :   "Organization",
                                                        
                                                            "id"        :   "http://www.mcneese.edu",
                                                            "url"       :   "http://www.mcneese.edu",
                                                                                                            
                                                        "name"      :   "McNeese State University"
                                                    },
                        "jobLocation"           :   {
                                                        "@type"     :   "Place",
                                                        "address"   :   {
                                                                            "@type"             :   "PostalAddress",
                                                                            "addressLocality"   :   "Lake Charles",
                                                                            "addressRegion"     :   "LA",
                                                                            "addressCountry"    :   "US"
                                                                        }
                                                    },
                    

                        "employmentType"        :   "FULL_TIME",
                        "image"                 :   "https://www.higheredjobs.com/assets/hej/img/HEJ_Logo_2c.png"
                    }
                </script>
<noscript>
<style type="text/css">
                @media (min-width: 768px) { /*menus work different at XS*/
                    HEADER ul.nav li.dropdown:hover > .arrow,
                    HEADER ul.nav li.dropdown:hover > ul.dropdown-menu {
                        display: block;
                    }
                }
            </style>
</noscript>

<div >
<h5>Resources</h5>
<ul>
<li><a href="/career/">Career Resources</a></li>
<li><a href="/salary/">Salary Data</a></li>
<li><a href="/career/resumes.cfm">Job Search Tips</a></li>
<li><a href="/career/ResumeService.cfm">Resume/CV Writing<br/> Service</a></li>
<li><a href="/articles/DiversityResources.cfm">Diversity Resources</a></li>
<li><a href="/career/SiteListings.cfm">Search Firms</a></li>
</ul>
</div>
</div>
</li>
</ul>
</li>
<li >
<a  data-toggle="dropdown" href="/employers/">


<div  id="jobAttrib">
<div >
<strong>Type:</strong>
        Full-Time
        <br/>
<strong>Salary:</strong>
            $76,000.00 USD Per Year
            <br/>
<strong>Posted:</strong>
            09/02/2022 
            <br/>
<strong>Application Due:</strong>
            Open Until Filled
            <br/>
<strong>Category:</strong>
<a  href="/faculty/search.cfm?JobCat=117">
                Electrical Engineering
            </a>
<br/>
</div>
</div>

</div>
</div>
<div id="jobDesc">
<p><strong><u>MCNEESE STATE UNIVERSITY</u></strong><strong> invites applicants for the position of Assistant Professor of
Engineering.</strong></p> <p><strong>DEPARTMENT: </strong>Engineering and Computer Science </p> <p><strong>POSITION INFORMATION:
</strong>This is a full-time, 9-month, unclassified, tenure-track position. The appointment begins in January 2023.</p> <p><strong>POSITION
NUMBER: </strong>F99507</p> <p><strong>SALARY: </strong>$76,000</p> <p><strong>REPORTING AUTHORITY: </strong>Department Head of Engineering
and Computer Science</p> <p><strong>POSITION DUTIES/RESPONSIBILITIES: </strong>The Assistant Professor of Engineering is responsible for
functions related to teaching courses, advising students, and conducting research. The assistant professor will work closely with faculty
and staff in the Department of Engineering and Computer Science.</p> <p><strong>REQUIRED/PREFERRED QUALIFICATIONS: </strong></p>
<p><u>Required</u>: </p> <ul>
<li>Master of Engineering with 18 hours in Electrical Engineering</li>
<li>Experience with Power
Engineering</li>
</ul> <p><u>Preferred</u>:</p> <ul>
<li>PhD in Electrical Engineering</li>
</ul> <p><strong>DEADLINE: </strong>Position
will remain open until filled.   
'''
soup = BeautifulSoup(html, 'html.parser')
script = soup.select_one('script[type="application/ld json"]')
# print(script.contents[0])
json_obj = json.loads(script.contents[0], strict=False)
print(json_obj['industry'])
print(json_obj['baseSalary']['value']['value'])
#### and here is a solution to find salary in actual html, not in script tag ###
salary_parent_div = soup.select_one('div.job-info')
salary = salary_parent_div.find('strong', string='Salary:').next_sibling
print(salary.strip())

This will print in terminal:

['Academic/Education', 'Education']
76,000.00
$76,000.00 USD Per Year

Of course you can drill down into that json object, and select more details about the job. BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

  • Related