Using arsenic library to scrab webpage, and then beautifulsoup to parse page source. Soup contains a large html with lots of scripts. I need -9 from the end.
page_source = await session.get_page_source()
soup = bs(page_source, 'html.parser')
scripts = soup.find_all('script')
script9 = scripts[-9].next
here is script9:
sometext;
var thumbdata = {
thumbs: [{avatar: "/i/nophoto.jpg", username: "IslandGirlSearching",la:"0 second ",chatid: "0",userid: "2088789", age:"21",city:"Cebu"},{avatar: "/p/2021-08/Cristina266/ava-1629535964.jpg", username: "Cristina266",la:"0 second ",chatid: "0",userid: "2095868", age:"26",city:"Pasig City"} ] };
var source = sometext;
then i follow example you shared:
pattern = re.compile(r"var thumbdata = {\n"
r"(.*?);")
m = pattern.match(script9.string)
thumbs = json.loads(m.groups()[0])
for thumb in thumbs:
print(thumb)
Checked my regex, its correct. But when i do this code, i get attribute error:
AttributeError: 'NoneType' object has no attribute 'groups'
CodePudding user response:
You have several issues with your approach still:
To pass a string to
json.loads()
, it needs to be valid JSON; otherwise, you'll get exceptions. For what you're attempting to capture, you need to include the leading{
token as part of your capture group. Consolidate your two separate patterns as such:var thumbdata = ({\n.*?);
You'll notice even with that change to grab the leading curly brace token, the string you've extracted still isn't valid JSON. While not the case with plain-old JavaScript objects, all key names must be encapsulated in quotes; the text you'll be extracting does not do this up front. As such, you'll need to swap out the built-in JSON parser (which is strictly spec-compliant and will not parse this data as JSON as-is) for something like
hjson
, which doesn't implement a spec with this restriction.re.match()
doesn't behave as you seem to think it does. A dive into the documentation for this method is illuminating in this specific circumstance (emphasis mine):Note that even in
MULTILINE
mode,re.match()
will only match at the beginning of the string and not at the beginning of each line.This is important as the string data in
script9
does not begin with any data that would be considered "matching" per your pattern. Instead, swap the invocation ofre.match()
forre.search()
instead.
Making a few more adjustments for the changes described above, your code would look something more like the following:
import re
import hjson
script9 = ''' sometext;
var thumbdata = {
thumbs: [{avatar: "/i/nophoto.jpg", username: "IslandGirlSearching",la:"0 second ",chatid: "0",userid: "2088789", age:"21",city:"Cebu"},{avatar: "/p/2021-08/Cristina266/ava-1629535964.jpg", username: "Cristina266",la:"0 second ",chatid: "0",userid: "2095868", age:"26",city:"Pasig City"} ] };
var source = sometext;
'''
pattern = re.compile(r"var thumbdata = ({\n.*?);")
m = pattern.search(script9)
thumbs = list(hjson.loads(m.groups()[0]).items())
print(thumbs)
outputs:
[('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])]
('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])