Home > other >  Crawl is incomplete
Crawl is incomplete

Time:09-18

Web site in https://www.dytt8.net/html/gndy/dyzz/list_23_1.html

Today wrote a climb in the name of the film paradise, cover photo url code
Crawl thief slowly and then try to use the thread, speed is soon, but crawl incomplete
As I climb only page 1 film ( a total of 25 ), but actually he was only took less than 25
Is stuck in the motionless, main process output didn't also way,
Thought is the applicability of the code is too weak, leading to a film information can't crawl, and then I had a statistical
The first time like when I run the in not crawl to film is in
The second time I run climb out

a face of meng, for help, thank you


 
The import requests
The import threading
The from LXML import etree

BASE_DOMAIN='https://www.dytt8.net/'

HEADERS={
'the user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 '
}


Def get_detail_urls (url) :
The response=requests. Get (url, headers=headers). The text
HTML=etree. HTML (response)
Detail_urls=HTML. Xpath ("//table [@ class='tbspan']//a/@ href ")
Detail_urls=map (lambda url: BASE_DOMAIN + url, detail_urls)

Return detail_urls


Def parse_detail_page (url) :
Movie={} # will crawl back to a film's information into the dictionary
The response=requests. Get (url, headers=headers)
Text=response. Content. decode (' GBK ', 'ignore')
HTML=etree. HTML (text)
Title=HTML. Xpath ("//div [@ class='title_all']//the font [@ color='# 07519 a]/text () ") [0]
Movie [' title ']=title
ZoomE=HTML. Xpath ("//div [@ id='Zoom'] ") [0]
Imgs=zoomE. Xpath ("./img/@/SRC ")
Cover=imgs [0]
The movie [' cover ']=cover
Infos=zoomE. Xpath (".//text () ")

Def parse_info (info, rule) :
The return info. Replace (rule, ""). The strip ()


Movie [' url '] url=

Print (movie)


Def spiders () :
Threads=list ()
Base_url="{https://www.dytt8.net/html/gndy/dyzz/list_23_}. HTML"

For I in range (1, 2) : # now only crawl the first page of the movie
Url=base_url. The format (I)
Movie=get_detail_urls (url)

For url_1 in the list (movie) :
Threads. Append (threading. Thread (target=parse_detail_page, args=(url_1,)))
Threads [1]. The start ()

For t in threads:
T.j oin ()


If __name__=="__main__ ':
Spiders ()
Print (' crawl to complete ')



CodePudding user response:

By the way, don't have to thread climb slowly, still not complete,
  • Related