According to this Regex code for getting the next line of a match, the commented line of my code should work after i added " ([^\r\n] ) " to it. But for some reason it isn't working ... I'm new to regex, so any tip is very welcome.
import scrapy
import json
class MlSpider(scrapy.Spider):
url1='https://produto.mercadolivre.com.br/MLB-1304118411-sandalia-feminina-anabela-confortavel-pingente-mac-cod-133-_JM?attributes=COLOR_SECONDARY_COLOR:UHJldGE=,SIZE:MzU=&quantity=1'
url2='https://www.mercadolivre.com.br/chinelo-kenner-rakka-pretolaranja-36-br-para-adulto-homem/p/MLB19132834?product_trigger_id=MLB19130858&attributes=COLOR:Preto/Azul,SIZE:36 BR&pdp_filters=category:MLB273770|shipping_cost:free&applied_product_filters=MLB19132871&quantity=1'
name = 'detalhador'
start_urls=[url2]
def parse(self, response,**kwargs):
d = response.xpath("//script[contains(., 'window.__PRELOADED_STATE__')]/text()").re_first(r'(?s)window.__PRELOADED_STATE__ = (. ?\});') # This only gets url1, because the following text of the string is in the same line as the string
if not d : # so this was made to get url2 as well
d = response.xpath("//script[contains(., 'window.__PRELOADED_STATE__')]/text()").re_first(r'(?s)window.__PRELOADED_STATE__ = ([^\r\n] )') #This should get the line bellow the matching string, but i dosent
CodePudding user response:
The issue is with your regex expression. You are not escaping certain symbols that regex uses internally as parsing tools and directions. Also you are using the literal ' ' space character where there is actually a newline characte immidiately after the =
sign. Using \s
is usually better because it means any whitespace character.
Try using this instead. I already tested and got the results you desire.
d = response.xpath("//script[contains(., 'window.__PRELOADED_STATE__')]/text()"
).re_first(r'window\.__PRELOADED_STATE__\s?\=\s*?(\{.*?\});')
The .{}=
characters are all used as parsing instructions by regex, so they need to be escaped with a \
when you want to use the literal character in your expression.
I also removed the (?s)
from the beginning of your expression, I'm not entirely sure why that was there.