Home > Enterprise >  How to Loop URL from A List in Scrappy and output only the response body to be downloaded into a XML
How to Loop URL from A List in Scrappy and output only the response body to be downloaded into a XML

Time:06-30

I have this issue where I have tried the Pipeline method but I am not sure if I am doing it right based on tutorial since most pick some portions from the response.body using selectors.

I however can parse it on a seperate script that gives me all the data that I need given that the data is jumbled up by other variables. Therefore I only need my scrapy to dump the response.body into either .XML or .TXT

I can do it when it is a single url but the moment i introduce various URL it overwrites the final parse. I believe there might be a simpler workaround without using the Pipelines/Items.py given that I am only needing the response.body.

Forgive the indentations cause it was hard to copy it over.

linkarr = df['URLOUT'].tolist()
today = datetime.today().strftime('%Y%m%d')

class MpvticketSpider(scrapy.Spider):

    name = 'mpvticket'   
    start_urls = url
    handle_httpstatus_list = [403,502,503,404]

    def start_requests(self):

        for url in linkarr:

            eventid = str(url).strip().split("pid=")[1].split("&")[0]
            filename_xml = str(eventid)   "_"   str(today)   ".xml"
            filename_txt = str(eventid)   "_"   str(today)   ".txt"
            
            print("\n FIRST  URL BEING RUN: ",url)
            pid = str(url).split("pid=")[1].split('&')[0]
            username = 'XXXX'
            password = 'XXXX'
            port = 22225
            session_id = random.random()
            super_proxy_url = ('http://%s-country-us-session-%s:%[email protected]:%d' %
                (username, session_id, password, port))

            headers = {
                'accept': 'text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                'accept-language': 'en-US,en;q=0.9',
                'cache-control': 'max-age=0',
                'referer': 'https://www.mlb.com/',
                'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
                'sec-ch-ua-mobile': '?0',
                'sec-ch-ua-platform': '"Windows"',
                'sec-fetch-dest': 'document',
                'sec-fetch-mode': 'navigate',
                'sec-fetch-site': 'same-origin',
                'sec-fetch-user': '?1',
                'upgrade-insecure-requests': '1',
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
            }
            yield scrapy.Request(url, callback=self.parse_api,meta={'proxy': super_proxy_url},headers=headers)

        def parse_api(self,response):
            item = TicketsItem()    
            raw_data = response.body
            soup = BeautifulSoup(raw_data,'lxml')
            item['data'] = soup
            yield item
            #Commented portion was the original method. But overwrote my Output.xml
            #try:
            #    with open(filename_xml, "w") as f:
            #        f.write(str(soup))
            #except:
            #    with open(filename_txt, 'w') as f:
            #            f.write(str(soup))

if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(MpvticketSpider)
    process.start()

CodePudding user response:

You should move the logic for what the filename/output path should be into your parse method, and then add the it as a field to your yielded item. Then in your item pipeline you can save the body to the output path and drop the item since there is no need for further processing at that point.

so change your parse method to something like this:

def parse_api(self,response):
    url = response.url
    eventid = str(url).strip().split("pid=")[1].split("&")[0] 
    filename_xml = str(eventid)   "_"   str(today)   ".xml"
    filename_txt = str(eventid)   "_"   str(today)   ".txt"
    data = response.xpath("//body").get()
    item = TicketsItem()
    item.data = data
    item.filename_xml = filename_xml
    item.filename_txt = filename_txt
    yield item

You would need to change your item to something like this:

class TicketsItem(scrapy.Item):
    filename_xml = scrapy.Field()
    filename_txt = scrapy.Field()
    data = scrapy.Field()

Then your items pipeline could look like this:

from scrapy.exceptions import DropItem

class SpiderPipeline:

    def process_item(self, item, spider):
        for filename in [item.filename_txt, item.filename_xml]:
            with open(filename, "wt") as fd:
                fd.write(item.data)
        raise DropItem
  • Related