Home > Net >  Regex to get next line of match returns empty
Regex to get next line of match returns empty

Time:08-19

According to this Regex code for getting the next line of a match, the commented line of my code should work after i added " ([^\r\n] ) " to it. But for some reason it isn't working ... I'm new to regex, so any tip is very welcome.

import scrapy
import json
class MlSpider(scrapy.Spider):
url1='https://produto.mercadolivre.com.br/MLB-1304118411-sandalia-feminina-anabela-confortavel-pingente-mac-cod-133-_JM?attributes=COLOR_SECONDARY_COLOR:UHJldGE=,SIZE:MzU=&quantity=1'
url2='https://www.mercadolivre.com.br/chinelo-kenner-rakka-pretolaranja-36-br-para-adulto-homem/p/MLB19132834?product_trigger_id=MLB19130858&attributes=COLOR:Preto/Azul,SIZE:36 BR&pdp_filters=category:MLB273770|shipping_cost:free&applied_product_filters=MLB19132871&quantity=1'   
   name = 'detalhador'
   start_urls=[url2] 

   def parse(self, response,**kwargs):
           d = response.xpath("//script[contains(., 'window.__PRELOADED_STATE__')]/text()").re_first(r'(?s)window.__PRELOADED_STATE__ = (. ?\});') # This only gets url1, because the following text of the string is in the same line as the string

           if not d : # so this was made to get url2 as well
                d = response.xpath("//script[contains(., 'window.__PRELOADED_STATE__')]/text()").re_first(r'(?s)window.__PRELOADED_STATE__ =  ([^\r\n] )') #This should get the line bellow the matching string, but i dosent

CodePudding user response:

The issue is with your regex expression. You are not escaping certain symbols that regex uses internally as parsing tools and directions. Also you are using the literal ' ' space character where there is actually a newline characte immidiately after the = sign. Using \s is usually better because it means any whitespace character.

Try using this instead. I already tested and got the results you desire.

d = response.xpath("//script[contains(., 'window.__PRELOADED_STATE__')]/text()"
                   ).re_first(r'window\.__PRELOADED_STATE__\s?\=\s*?(\{.*?\});')

The .{}= characters are all used as parsing instructions by regex, so they need to be escaped with a \ when you want to use the literal character in your expression.

I also removed the (?s) from the beginning of your expression, I'm not entirely sure why that was there.

  • Related