I'm new to scrapy, learning atm and I'm trying to access JSON data on a page html and put them in a python dict and work with data later so I did try serval things, all failed, would appreciate if anyone could help me with that
I found the response.css to the desired tag which result looks like this in scrapy shell:
response.css('div.rich-snippet script').get()
'<script type="application/ld json">{\n some json data with newline chars \n }\n ]\n}</script>'
I need everything between {}
but, so I tried regex to do it, like this:
response.css('div.rich-snippet script').re(r'\{[^}]*\}')
this regex should pick everything between brackets but there are more of these symbols in JSON and there are other things in the response before the JSON data so this returns just empty list I tried more but always the same results, an empty list
.re(r'<script>\{[^}]*\}</script>')
.re(r'<script>(.|\n)*?<\/script>')
...
so I tried something else, inside the spider I tried to parse the response directly to json.loads method and save the results in file from cli, that doesn't work either, perhaps I'm parsing the tag wrong or it's not even possible
import scrapy
import json
class SomeSpider(scrapy.Spider):
name = 'test'
start_urls = [
'url'
]
def parse(self, response, **kwargs):
json_file = response.css('div.rich-snippet script').get()
yield json.loads(json_file)
yet again, an empty result
Pls help me to understand, thanks.
CodePudding user response:
Your css selector should specify that you only want the part inside the tag, that is should be ::text
, so your code becomes:
def parse(self, response, **kwargs):
json_file = response.css('div.rich-snippet script::text')
yield json.loads(json_file)
You might also want to have a look at: https://github.com/scrapinghub/extruct
It might better fit parsing ld json
CodePudding user response:
You could take the response as a string and use a recursive regex on. Recursion is not supported by the original re
module but by the newer regex
one.
That said, a possible approach could be:
import regex
# code before
some_json_string = response.css('div.rich-snippet script').get()
match = regex.search(r'\{(?:[^{}]*|(?R)) \}', some_json_string)
if match:
relevant_json = match.group(0)
# process it further here
See a demo on regex101.com for the expression.
Edit:
It seems that ::text
is supported, so better use this answer instead.