Home > other >  For help! Using Scrapy crawl stock information exist TXT, file content is empty
For help! Using Scrapy crawl stock information exist TXT, file content is empty

Time:11-03

Request, help to look at, my problem, in a recent study of Scrapy crawler frame, control of north song day teacher code, to select site, use Scrapy crawl command is run, no error in the CMD command line, but no content within the TXT file, here is my code, bosses, please help debug,

Crawl site for a city network:
Stock list links: https://hq.gucheng.com/gpdmylb.html
Stocks information links: https://hq.gucheng.com/'+ stock

1. The spiders crawler file (stocks. Py)

The import scrapy
The import re

The class StocksSpider (scrapy. Spiders) :
Name='stocks'
# allowed_domains=[' hq.gecheng.com ']
Start_urls=[' https://hq.gucheng.com/gpdmylb.html ']

Def parse (self, response) :
# to extract a link in a tag
Kv={' the user-agent ':' Mozilla/5.0} # simulation browser sends a request
For href in response. The CSS (' a: : attr (href) '). The extract () :
Try:
Stock=re. The.findall (r "[S] [HZ] \ d {6}", href) [0] # through a regular expression to obtain the correct stock code
Url='https://hq.gucheng.com/' + stock
Yield scrapy. Request (url, the callback=self parse_stock, headers=kv)
# the second parameter callback is given the current url is the new function of the parse_stock
# return item
Except:
The continue

Def parse_stock (self, response) :
InfoDict={} # for each page to generate an empty dictionary
StockInfo=response. CSS (' stock_top clearfix ')
Name=stockInfo. CSS (' stock_title '). The extract () [0]
KeyList=stockInfo. CSS (' dt) extract ()
ValueList=stockInfo. CSS (" dd "). The extract ()
For I in range (len (keyList) :
Key=re. The.findall (r '& lt; Dt> . * & lt;/dt> ', keyList [I]) [0] [1, 5]
# key=key. Replace (' \ u2003 ', ')
# key=key. Replace (' \ xa0 ', ')
Try:
Val=re. The.findall (r '& lt; Dd> \ d + \.? . * & lt;/dd> ', valueList [I]) [0] [0: - 5]
Except:
Val='-'
InfoDict [key]=val

InfoDict. Update (
{' stock name: re the.findall (' \ \ s * (name), [0]. The split () [0] + re. The.findall (' \ & gt; . * \ & lt; 'name) [0] [1, 1]})
Yield infoDict

2. Pipelines. Py

The class GuchengstocksInfoPipeline (object) :
# openspider refers to when a crawler corresponding pipline start method is called
Def open_spider (self, spiders) :
Self. F=open (' GuchengStockInfo. TXT ', 'w')

# close_spider refers to when a crawler closing corresponding method of pipline start
Def close_spider (self, spiders) :
Self. F. lose ()

Item # for each item to process the corresponding method, is also the main body function
Def process_item (self, item, spiders) :
Try:
The line=STR (dict (item)) + '\ n'
Self. F.w rite (line)
Except:
Pass
Return the item

3. The configuration file Settings

ITEM_PIPELINES={
'GuchengStocks. Pipelines. GuchengstocksInfoPipeline: 300,
}
  • Related