How to get data from <script> with var using beautifulsoup?-CodePudding

Using arsenic library to scrab webpage, and then beautifulsoup to parse page source. Soup contains a large html with lots of scripts. I need -9 from the end.

page_source = await session.get_page_source()
    soup = bs(page_source, 'html.parser')
    scripts = soup.find_all('script')
    script9 = scripts[-9].next

here is script9:

    sometext;
var thumbdata = {
  thumbs: [{avatar: "/i/nophoto.jpg", username: "IslandGirlSearching",la:"0 second ",chatid: "0",userid: "2088789", age:"21",city:"Cebu"},{avatar: "/p/2021-08/Cristina266/ava-1629535964.jpg", username: "Cristina266",la:"0 second ",chatid: "0",userid: "2095868", age:"26",city:"Pasig City"}  ] }; 
  var source = sometext;

then i follow example you shared:

    pattern = re.compile(r"var thumbdata = {\n"
                         r"(.*?);")

    m = pattern.match(script9.string)
    thumbs = json.loads(m.groups()[0])

    for thumb in thumbs:
        print(thumb)

Checked my regex, its correct. But when i do this code, i get attribute error:

AttributeError: 'NoneType' object has no attribute 'groups'

CodePudding user response：

You have several issues with your approach still:

To pass a string to json.loads(), it needs to be valid JSON; otherwise, you'll get exceptions. For what you're attempting to capture, you need to include the leading { token as part of your capture group. Consolidate your two separate patterns as such:
```
var thumbdata = ({\n.*?);
```
^Regex101
You'll notice even with that change to grab the leading curly brace token, the string you've extracted still isn't valid JSON. While not the case with plain-old JavaScript objects, all key names must be encapsulated in quotes; the text you'll be extracting does not do this up front. As such, you'll need to swap out the built-in JSON parser (which is strictly spec-compliant and will not parse this data as JSON as-is) for something like hjson, which doesn't implement a spec with this restriction.

^{Relevant SO thread}
re.match() doesn't behave as you seem to think it does. A dive into the documentation for this method is illuminating in this specific circumstance (emphasis mine):

Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

This is important as the string data in script9 does not begin with any data that would be considered "matching" per your pattern. Instead, swap the invocation of re.match() for re.search() instead.

Making a few more adjustments for the changes described above, your code would look something more like the following:

import re
import hjson

script9 = '''    sometext;
var thumbdata = {
  thumbs: [{avatar: "/i/nophoto.jpg", username: "IslandGirlSearching",la:"0 second ",chatid: "0",userid: "2088789", age:"21",city:"Cebu"},{avatar: "/p/2021-08/Cristina266/ava-1629535964.jpg", username: "Cristina266",la:"0 second ",chatid: "0",userid: "2095868", age:"26",city:"Pasig City"}  ] }; 
  var source = sometext;
'''

pattern = re.compile(r"var thumbdata = ({\n.*?);")

m = pattern.search(script9)
thumbs = list(hjson.loads(m.groups()[0]).items())
print(thumbs)

^Repl.it

outputs:

[('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])]
('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])