Home > other >  Scrapy an post browser can access data scrapy is empty
Scrapy an post browser can access data scrapy is empty

Time:09-25


Destination address: http://www.ccgp-gansu.gov.cn/web/article/128/0/index.htm
This web site to post submission, return HTML text, detailed code can see my
Want to crawl: the content of the project in the list of
Question: scrapy obtain the body there is no list of data ul li
Tried to solve, use cookiejar: True, there is still no data
Hope to have the ability to friend, can give a little hint, appreciate
Spiders source file
 
# - * - coding: utf-8 - * -
The import re

The import scrapy
The import scrapy_splash
The from demo. The items import DemoItem
The from datetime import datetime


The class GgzyfwSpider (scrapy. Spiders) :
Name='GSCCGP'
Allowed_domains=[' www.ccgp-gansu.gov.cn ']
Start_urls=[' http://www.ccgp-gansu.gov.cn/web/doSearchmxarticle.action ']
Url='http://www.ccgp-gansu.gov.cn/web/doSearchmxarticle.action'

Def get_form_data (self, page) :
Content={' articleSearchInfoVo. Releasestarttime ':',
'articleSearchInfoVo. Releaseendtime' : ' ',
'articleSearchInfoVo. Tflag' : '1',
'articleSearchInfoVo. Classname', '128',
'articleSearchInfoVo. Dtype' : '0',
'articleSearchInfoVo. Days' :' ',
'articleSearchInfoVo. Releasestarttimeold' : ' ',
'articleSearchInfoVo. Releaseendtimeold' : ' ',
'articleSearchInfoVo. The title' : ' ',
'articleSearchInfoVo. Agentname' : ' ',
'articleSearchInfoVo. Bidcode' : ' ',
'articleSearchInfoVo. Proj_name' : ' ',
'articleSearchInfoVo. Buyername' : ' ',
'total', '5402',
'limit' : '20',
'current' : STR (page),
'SJM' : '7466'}
Return content

Def start_requests (self) :
='post' yield scrapy_splash. SplashFormRequest (method, formdata=https://bbs.csdn.net/topics/self.get_form_data (1),
Url=self. Url, the callback=self. Parse)

Def parse (self, response) :
Tr_list=response. Xpath ("//ul/@ class='Expand_SearchSLisi']/li ")
If not tr_list:
Return
The else:
Pass

The current=self. Settings. The get (' CURRENT_DATA)
Domain='http://www.ccgp-gansu.gov.cn'
# header is the first tr
For li in tr_list:
Date_str=li. Xpath (" string (.//span [1]//text ()) "). The get (). The strip ()

# time of bid opening: | release date: 2020-03-12 20:41:01 | purchasing: kongdong district town people's government of pingliang city | agency: gansu Haitian construction engineering cost consulting co., LTD.
Date_arr=date_str. Split (' | ')
The date=date_arr [1]. The split (' : ') [1]. The strip ()
Buy_person=date_arr [2]. The split (' : ') [1]. The strip ()
Middle_name=date_arr [3]. The split (' : ') [1]. The strip ()
If the date:
# project_time
Date_time=datetime. Strptime (date, "% % Y - m - H: % d % % m: % S")

Now_time=datetime. Now ()

Diff_day=(now_time - date_time.) days

If diff_day & gt; Current:
# pictures so I can't stop the crawler to deal with because stops the image is also does not handle the
# the self. The crawler. Engine. Close_spider (self, 'date')
# print (' & gt;>> Date after ')
Return
The else:
# print (' & gt;>> You can continue to ')
Pass
The item=DemoItem ()
The item [' publish_date]=date
The item [' source_url]=self. Start_urls [0]
Item [' project_name]=li. Xpath (".//a//text () "). The get ()
A href=https://bbs.csdn.net/topics/li.xpath (.//a/@ href '). The get ()
Item [' url ']=domain + href

# waste standard/termination notice | kongdong, pingliang area | agricultural town people's government, forest, animal husbandry, fishery
Other_str=li. Xpath (" string (.//span/strong//text ()) "). The get (). The strip ()
Other_arr=other_str. Split (' | ')
The item [' status']=other_arr [0]. Strip ()
The item [' buy_area]=other_arr [1]. The strip ()
The item [' project_type_name]=other_arr [2]. The strip ()
The item [' buy_person]=buy_person
The item [' middle_name]=middle_name
Print (item)
# yield item

The depth=response. Meta. Get (' the depth, 0)
Page=the depth + 1
Url=domain + '/web/doSearchmxarticle. The action? Limit=20 & amp; Start='+ STR (page * 20)
Yield scrapy. Request (url=url, the callback=self. Parse)

CodePudding user response:

Capture part of the code, you run a requests only after parsing code, data is successful
 
# - * - coding: utf-8 - * -
The import re

The from datetime import datetime
The class GgzyfwSpider () :
Name='GSCCGP'
Allowed_domains=[' www.ccgp-gansu.gov.cn ']
Start_urls=[' http://www.ccgp-gansu.gov.cn/web/doSearchmxarticle.action ']
Url='http://www.ccgp-gansu.gov.cn/web/doSearchmxarticle.action'

Def get_form_data (self, page) :
Content={' articleSearchInfoVo. Releasestarttime ':',
'articleSearchInfoVo. Releaseendtime' : ' ',
'articleSearchInfoVo. Tflag' : '1',
'articleSearchInfoVo. Classname', '128',
'articleSearchInfoVo. Dtype' : '0',
'articleSearchInfoVo. Days' :' ',
'articleSearchInfoVo. Releasestarttimeold' : ' ',
'articleSearchInfoVo. Releaseendtimeold' : ' ',
'articleSearchInfoVo. The title' : ' ',
'articleSearchInfoVo. Agentname' : ' ',
'articleSearchInfoVo. Bidcode' : ' ',
'articleSearchInfoVo. Proj_name' : ' ',
'articleSearchInfoVo. Buyername' : ' ',
'total', '5402',
'limit' : '20',
'current' : STR (page),
'SJM' : '7466'}
Return content

Def start_requests (self) :
The import requests
Resp=requests. Post (url=self. Url, data=https://bbs.csdn.net/topics/self.get_form_data (1))
The self. The parse (resp)

Def parse (self, response) :
The import LXML
nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
  • Related