I recently finished a WebScrapping/Automation Zillow program for my boot camp. Instructor encouraged google as I was having issues with only being able to get the first couple of listing.
I stumbled upon this answer: Zillow web scraping using Selenium & BeautifulSoup
This worked well since instead of using bs4's find all method, I was able to get all of my listing neatly placed in a JSON file which was much easier to go through and complete the project. I only recently learned about regex and the re module on python and I was wondering if someone can explain how this code worked to help me retrieve the the nicely listed JSON from the get response and if this would work for other websites?
Code was:
self.data = json.loads(re.search(r'!--(\{"queryState".*?)-->', self.response.text).group(1))
- What arguments was taken account for on the
json.loads
? - How did the oddly written
!--({"queryState".*?)-->
work? - What is the purpose of the
.group(1)
?
I hate just copy and pasting but somehow this worked like magic and Id like to know how to replicate this for future projects. Sorry if this is loaded but the re.search documentation wasn't as helpful as I thought.
CodePudding user response:
json.loads()
can work with a single argument, a string that will be parsed as JSON and the return value is typically a dictionary or list (depending on the JSON). Here, that single string is the return value of the call to.group(1)
- How is
r'!--(\{"queryState".*?)-->'
oddly written? It is a regular expression that is being applied toself.response.text
usingre.search()
. It looks for the literal!--
and-->
followed by something starting with{"queryState"
. The\
is there to indicated that the{
is to be matched literally as well. The.*?
indicates "any character zero or more times, not greedily (to avoid matching-->
as part of it). .group(1)
returns the first matched group in the regex, which is the first part in parentheses. In this case, anything in between!--
and-->
, if it starts with{"queryState"
So, if self.response.text
would be this:
something
!--{"not queryState": 123}-->
something else
!--{"queryState": 123}-->
something else
Then running this:
self.data = json.loads(re.search(r'!--(\{"queryState".*?)-->', self.response.text).group(1))
Would set self.data
to "{'queryState': 123}"