Home > Software design >  regex that access json data from javascript html tag with scrapy
regex that access json data from javascript html tag with scrapy

Time:10-11

I'm new to scrapy, learning atm and I'm trying to access JSON data on a page html and put them in a python dict and work with data later so I did try serval things, all failed, would appreciate if anyone could help me with that

I found the response.css to the desired tag which result looks like this in scrapy shell:

response.css('div.rich-snippet script').get()

'<script type="application/ld json">{\n    some json data with newline chars \n  }\n    ]\n}</script>'

I need everything between {} but, so I tried regex to do it, like this:

response.css('div.rich-snippet script').re(r'\{[^}]*\}')

this regex should pick everything between brackets but there are more of these symbols in JSON and there are other things in the response before the JSON data so this returns just empty list I tried more but always the same results, an empty list

.re(r'<script>\{[^}]*\}</script>')
.re(r'<script>(.|\n)*?<\/script>')
...

so I tried something else, inside the spider I tried to parse the response directly to json.loads method and save the results in file from cli, that doesn't work either, perhaps I'm parsing the tag wrong or it's not even possible

    import scrapy
    import json

 class SomeSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'url'
    ]

    def parse(self, response, **kwargs):
        json_file = response.css('div.rich-snippet script').get()

        yield json.loads(json_file)

yet again, an empty result

Pls help me to understand, thanks.

CodePudding user response:

Your css selector should specify that you only want the part inside the tag, that is should be ::text, so your code becomes:


    def parse(self, response, **kwargs):
        json_file = response.css('div.rich-snippet script::text')

        yield json.loads(json_file)

You might also want to have a look at: https://github.com/scrapinghub/extruct

It might better fit parsing ld json

CodePudding user response:

You could take the response as a string and use a recursive regex on. Recursion is not supported by the original re module but by the newer regex one.
That said, a possible approach could be:

import regex

# code before 

some_json_string = response.css('div.rich-snippet script').get()
match = regex.search(r'\{(?:[^{}]*|(?R)) \}', some_json_string)

if match:
    relevant_json = match.group(0)
    # process it further here

See a demo on regex101.com for the expression.


Edit:

It seems that ::text is supported, so better use this answer instead.

  • Related