Home > database >  How to parse <script> tag using beautifulsoup
How to parse <script> tag using beautifulsoup

Time:01-03

I am trying to read the window.appCache from a glassdoor reviews site.

url = "https://www.glassdoor.com/Reviews/Alteryx-Reviews-E351220.htm"
html = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) 
soup = BeautifulSoup(html.content,'html.parser') 
text = soup.findAll("script")[0].text

This isolates the dict I need however when I tried to do json.loads() I get the following error:

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) 

I checked the type of text and it is str.

When I print text to a file, it looks something like this (just a snippet as the output is about 5000 lines):

window.appCache={"appName":"reviews","appVersion":"7.14.12","initialState"
{"surveyEndpoint":"https:\u002F\u002Femployee-pulse-survey-b2c.us-east-1.prod.jagundi.com",
"i18nStrings":{"_":"JSON MESSAGE BUNDLE - do not remove",
"eiHeader.seeAllPhotos":"
See All Photos","eiHeader.viewJobs":"View Jobs",
"eiHeader.bptw.description":"This employer is a winner of the [year] Best Places to Work award. 
Winners were determined by the people who know these companies best...

I am only concerned with the "reviews":[ field that is buried about halfway through the data, but I can't seem to parse the string into json and retrieve what I need.

CodePudding user response:

One solution is to parse the required data with re/json module:

import json
import pprint
import re

import requests

url = "https://www.glassdoor.com/Reviews/Alteryx-Reviews-E351220.htm"

html = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text

reviews = re.search(r'"reviews":(\[.*?}\])}', html, flags=re.S).group(1)
reviews = json.loads(reviews)

pprint.pprint(reviews)

Prints:

[{'__typename': 'EmployerReview',
  'advice': "Don't rush too finish a project",
  'adviceOriginal': None,
  'cons': 'Typical like other companies where newbies get higher salary and '
          'you have to work your way up for promotions nothing really bad',
  'consOriginal': None,
  'countHelpful': 0,
  'countNotHelpful': 0,
  'divisionLink': None,
  'divisionName': None,
  'employer': {'__ref': 'Employer:351220'},
  'employerResponses': [],
  'employmentStatus': None,
  'isCovid19': False,
  'isCurrentJob': True,
  'isLanguageMismatch': False,
  'isLegal': True,
  'jobEndingYear': None,
  'jobTitle': None,
  'languageId': 'eng',
  'lengthOfEmployment': 6,
  'location': None,

...and so on.

CodePudding user response:

Well, json.loads() should take a string that contains a JSON document. However, the value of text is not a valid JSON because of the window.appCache= at the beginning.

And it's not just that, I tried slicing text to exclude the window.appCache= part:

text = text[len("window.appCache="):]

and it gave me this error:

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 68113 (char 68112)

so I checked the value of text[68110:] and it turns out that it was complaining because indeed text is not a valid JSON document:

undefined,"useErrorPages":true},"parsedRequest":{"urlData":{"url":"\u002FReviews\u002FAlteryx-Reviews-E351220.htm","params":{"employerId":351220,"page":null},"pagePrefix":"P","origin":"http:\u002F\u002Fwww.glassdoor.com"}},"seoConfig":{"appName":"reviews","staticPaths":[],"seoABTest":{},"pageType":"EMPLOYER_INFO","pageContentType":"REVIEWS","urlRegexMatchers":[function genericEiReviewsUrlMatcher(originalUrl) {
  var url = decodeURIComponent(originalUrl);
  var result = {
    params: {},
    helpers: {
      dos2ExperimentHelpers: _dos2ExperimentHelpers["default"]
    }
  };

  var getHumanReadableText = function getHumanReadableText(data) {
    return data.replace(/[ -]/g, ' ');
  };

This is the result of text[68110:], it is a JavaScript object, but not a valid JSON object.

JSON values cannot be one of the following data types:

  • a function
  • a date
  • undefined

As you can see, text has undefined and a function as values for some fields.

If you want the value of a specific field ("reviews" as you mentioned for example), I recommend parsing the string manually using maybe regular expressions or something like that.

CodePudding user response:

bs4 only parses HTML, not JavaScript (nor CSS); as some of the comments have mentioned, a common approach is to split text at = and use json.loads to parse window.appCache, but in this case, that will still raise the JSONDecodeError error because window.appCache contains js functions and js primitive values (like undefined).


I have a function findObj_inJS which uses slimit to parse a string containing JavaScript code and extract an object/variable from it. For example, findObj_inJS(text, '"reviews"') will return

{'name': 'Native_infosite_reviews_fluid_en-US', 'id': 'div-AdSlot-native-infosite-reviews', 'fluid': True}

and findObj_inJS(text, '"reviews"', findAll=True) will return

[
 {'name': 'Native_infosite_reviews_fluid_en-US', 'id': 'div-AdSlot-native-infosite-reviews', 'fluid': True},
 [
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 72183587, 'reviewDateTime': '2022-12-29T01:05:43.253', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 6, 'employmentStatus': None, 'jobEndingYear': None, 'jobTitle': None, 'location': None, 'originalLanguageId': None, 'pros': '-Great managers/leaders -Great benefits -Remote work', 'prosOriginal': None, 'cons': 'Typical like other companies where newbies get higher salary and you have to work your way up for promotions nothing really bad', 'consOriginal': None, 'summary': 'Great place to work, been here 4  years', 'summaryOriginal': None, 'advice': "Don't rush too finish a project", 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71986771, 'reviewDateTime': '2022-12-19T12:56:24.837', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 0, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:46094'}, 'location': None, 'originalLanguageId': None, 'pros': 'Great people, great culture, and exciting times ahead', 'prosOriginal': None, 'cons': 'Nothing to complain about for internal issues', 'consOriginal': None, 'summary': 'Best culture ever!', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71858088, 'reviewDateTime': '2022-12-14T08:39:44.030', 'ratingOverall': 4, 'ratingCeo': None, 'ratingBusinessOutlook': None, 'ratingWorkLifeBalance': 0, 'ratingCultureAndValues': 0, 'ratingDiversityAndInclusion': 0, 'ratingSeniorLeadership': 0, 'ratingRecommendToFriend': None, 'ratingCareerOpportunities': 0, 'ratingCompensationAndBenefits': 0, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 0, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:15169'}, 'location': None, 'originalLanguageId': None, 'pros': 'Alteryx has a good comp plan and if you’re a high performer they let you maximize your earnings. The product is amazing and we have a very fanatic customer base that love what the platform does.', 'prosOriginal': None, 'cons': 'A boys club in sales leadership. We have a female CRO and the diversity ends there. 90% of sales leaders are men and are buddies of current leaders that are brought over from their past jobs. There are leaders who have HR complaints against them but still hold jobs because they’re friends with SVP. The only female segment leader isn’t even given the same title as her male peers for holding the same job, she is an RVP while her 3 male counterparts are VPs. The sexism in leadership and in the sales org is pretty blatant and has not improved. They hired a DEI leader who does not seem to want to investigate the issues in sales even though they’ve been raised by lots of reps.', 'consOriginal': None, 'summary': 'Amazing product, great benefits, sexist sales culture', 'summaryOriginal': None, 'advice': 'You need to listen to individual contributors and lower level folks, and not just rely on your SVPs and VPs to get a pulse on the org. Younger reps care about diversity and inclusion and real equity, not just lip service and you will struggle to get any talent under 50 years old (like you have for years) to join the company since they will prefer organizations with better policies like salesforce. The fact that we cannot recruit female sellers because our maternity policy is still not on par with our tech peers should be concerning but nobody seems to discuss that other than 1st line manager who end up giving up and settling for having 1 woman per team. At some point you will fall so behind you won’t be able to catch up with the industry and become a company with a modern culture.', 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 6, 'countNotHelpful': 0, 'employerResponses': [{'__ref': 'EmployerResponse:4414519'}], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 72218335, 'reviewDateTime': '2022-12-30T17:50:21.263', 'ratingOverall': 2, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'NEGATIVE', 'ratingWorkLifeBalance': 2, 'ratingCultureAndValues': 3, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 4, 'ratingRecommendToFriend': 'NEGATIVE', 'ratingCareerOpportunities': 2, 'ratingCompensationAndBenefits': 4, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 2, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:34553'}, 'location': None, 'originalLanguageId': None, 'pros': 'Genuinely nice people are working at Alteryx. Great vision and hands-on c-suite leaders.', 'prosOriginal': None, 'cons': "Not many nice people aren't highly-skilled people. Many of PMs did not have prior PM experience from a tech company. Middle-managers are inexperienced except a few superstar PM Directors. Coming from tech industry with many years of experience, Alteryx is an extremely frustrating workplace. The best people from the tech industry who joined the company is leaving quickly because of that. The recent acquisitions made many of us in the states to work in early morning and evening because they came with off-shore offices, and WLB declined significantly this year.", 'consOriginal': None, 'summary': 'Sales driven senior leaders with below average Product and Engineering', 'summaryOriginal': None, 'advice': 'Please hire silicon valley top talents for the middle-manager roles instead of keep hiring their former colleagues from some mediocre companies. Otherwise, you will keep losing more talented professionals. Let our CPO lead the product engineering innovation. Too many teams outside of PE have so much to say and influence.', 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71891292, 'reviewDateTime': '2022-12-15T09:29:44.833', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 3, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 2, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:15169'}, 'location': None, 'originalLanguageId': None, 'pros': 'Supportive Executives Supporting teams Great compensation', 'prosOriginal': None, 'cons': 'Mid-level management are overkill Enterprise Team is confused on their objective', 'consOriginal': None, 'summary': 'Great place to work', 'summaryOriginal': None, 'advice': 'Keep it simple for sales', 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71907512, 'reviewDateTime': '2022-12-15T22:28:47.670', 'ratingOverall': 3, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 3, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 4, 'ratingSeniorLeadership': 3, 'ratingRecommendToFriend': 'NEGATIVE', 'ratingCareerOpportunities': 3, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 4, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:29284'}, 'location': {'__ref': 'City:1148161'}, 'originalLanguageId': None, 'pros': '- Great culture - Diverse team - Good base salary compared to companies in the same field - Opportunities for networking - Competitive benefits package - Product adoption has been increasing over the years - Teammates always willing to help - Opportunity to learn from the best in the field', 'prosOriginal': None, 'cons': '- Lack of transparency in the workplace - Poor employee promotion/retention plan - Meritocracy is not used for promotions - Some professionals are underappreciated and undervalued, while low performers are highly recognized - Difficulties in finding meaning in the work', 'consOriginal': None, 'summary': 'Good company, but poor leadership', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 2, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71631140, 'reviewDateTime': '2022-12-05T11:39:02.247', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 1, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:15169'}, 'location': None, 'originalLanguageId': None, 'pros': "Alteryx is the most employee focused organization that I've worked for. In addition to excellent compensation and benefits, examples of how Alteryx goes above and beyond include: providing time to focus on mental health (two days off a year), encouraging employees to take time off to volunteer through Alteryx for Good, and providing opportunities for employees to grow their career through an emerging leaders program. The culture of the organization, from the leadership team on down, is very transparent, passionate, and positive which is why I think the company continues to hire and maintain some of the best and brightest in the tech space. If you have a chance to work at Alteryx, I would encourage you to make the move!", 'prosOriginal': None, 'cons': "None - I'm looking forward to 2023!", 'consOriginal': None, 'summary': 'The Best of the Best!', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [{'__ref': 'EmployerResponse:4414520'}], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71567507, 'reviewDateTime': '2022-12-02T08:53:58.140', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 1, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:15169'}, 'location': None, 'originalLanguageId': None, 'pros': 'Compensation is in line with top tech firms - I came from a fortune 100 technology company and got a raise and additional equity here. The tools provided to you are world class, and they consistently invest in helping their people be more productive. Leadership is incredible, best I have ever seen across my 12 year career in sales. Marketing and sales talk to each other, so events are clearly communicated and the sales team has input on creating events sponsored by marketing/getting marketing dollars to cover events that the company should have exposure at. On-premises product is outstanding.', 'prosOriginal': None, 'cons': 'Account penetration in certain segments can be difficult because we are unknown to them. Once your foot is in the door, the ability to solve business problems and integrate into the current technology stack is unmatched. Cloud platform is still developing, but currently does not have the capabilities to be an option for enterprise level companies without the on-premises technology supporting it. Still 12-24 months away.', 'consOriginal': None, 'summary': 'They get it', 'summaryOriginal': None, 'advice': "Keep doing what you're doing.", 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [{'__ref': 'EmployerResponse:4414562'}], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71924396, 'reviewDateTime': '2022-12-16T12:34:18.837', 'ratingOverall': 4, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 4, 'ratingCultureAndValues': 4, 'ratingDiversityAndInclusion': 4, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 4, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 2, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:274217'}, 'location': None, 'originalLanguageId': None, 'pros': 'Passionate employees, growing and scaling in the Data Analytics space.', 'prosOriginal': None, 'cons': 'none come to mind worth mentioning at this time', 'consOriginal': None, 'summary': 'Great company', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None},
  {'__typename': 'EmployerReview', 'isLegal': True, 'reviewId': 71542996, 'reviewDateTime': '2022-12-01T11:24:00.407', 'ratingOverall': 5, 'ratingCeo': 'APPROVE', 'ratingBusinessOutlook': 'POSITIVE', 'ratingWorkLifeBalance': 5, 'ratingCultureAndValues': 5, 'ratingDiversityAndInclusion': 5, 'ratingSeniorLeadership': 5, 'ratingRecommendToFriend': 'POSITIVE', 'ratingCareerOpportunities': 5, 'ratingCompensationAndBenefits': 5, 'employer': {'__ref': 'Employer:351220'}, 'isCurrentJob': True, 'lengthOfEmployment': 1, 'employmentStatus': 'REGULAR', 'jobEndingYear': None, 'jobTitle': {'__ref': 'JobTitle:2766820'}, 'location': {'__ref': 'City:1146798'}, 'originalLanguageId': None, 'pros': "Alteryx has truly exceeded all my expectations! From the culture it's created for employees, to the amazing peers and leaders I work with. Sr. Leaders have a clear vision and roadmap for the Org. and I'm so excited to be able to be part of it. Best decision I've ever made! Proud to be an Alteryx employee", 'prosOriginal': None, 'cons': 'Absolutely NONE! This organization Rocks!', 'consOriginal': None, 'summary': 'Amazing Company, Amazing People!', 'summaryOriginal': None, 'advice': None, 'adviceOriginal': None, 'isLanguageMismatch': False, 'countHelpful': 0, 'countNotHelpful': 0, 'employerResponses': [{'__ref': 'EmployerResponse:4414563'}], 'isCovid19': False, 'divisionName': None, 'divisionLink': None, 'topLevelDomainId': 1, 'languageId': 'eng', 'translationMethod': None}
 ]
]

[I think you probably want findObj_inJS(text, '"reviews"', findAll=True)[1]]

  • Related