Home > other >  The crawler small white, who is a great god to explain the results only header (crawl douban top250)
The crawler small white, who is a great god to explain the results only header (crawl douban top250)

Time:10-23

The import requests
The import bs4
The import re

Def open_url (url) :
Headers={' the user-agent ':' Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE (2) X 1.0 'MetaSr}
Res=requests. Get (url, headers=headers)
Return res

Def find_movies (res) :
Soup=bs4. BeautifulSoup (res) text, '. The HTML parser)
# movie
Movies=[]
The targets=soup. Find_all (" div ", class_="hd")
For each in the targets:
Movies. Append (each. A.s pan. Text)

# score
Ranks=[]
The targets=soup. Find_all (' span, class_='rating_num')
For each in the targets:
Ranks. Append (' score: % s' % each. The text)
# information
Messages=[]
The targets=soup. Find_all (' div 'class_=' bd ')
For each in the targets:
Try:
Messages. Append (each. P.t ext split (' \ n ') [1]. Strip () + each. P.t ext split (' \ n ') [2]. Strip ())
Except:
The continue

Result=[]
Length=len (movies)
For I in range (length) :
Result. Append (movies [I] + ranks messages [I] + [I] + '/n')
Return the result

Def the main () :
The host='https://movie.douban.com/top250'
Res=open_url (host)
The depth=10
Result=[]
For I in range (the depth) :
Url=host + '? 25 * start="+ STR (I) + '& amp; The filter='
Res=open_url (url)
Result. The extend (find_movies (res))

With the open (' doubaner. TXT ', 'w', encoding="utf-8") as f:
For each result in:
F.w rite (each)
F. lose ()

If __name__=="__main__ ':
main()

CodePudding user response:

You are not, as long as the film scores and information, it is normal

CodePudding user response:

Why my result is only a header

CodePudding user response:

Use your code to run, return the douban. TXT messages have ah,,

CodePudding user response:

What also have no feedback after running

CodePudding user response:

Your res. Text preserved see content right

CodePudding user response:

Try to use simplified - scrapy by the library, he give you see, the need to install the PIP install simplified - scrapy
 the from simplified_scrapy. Simplified_doc import SimplifiedDoc 
Def test (HTML) :
Doc=SimplifiedDoc (HTML)
LST=doc. GetElements (' div 'value="https://bbs.csdn.net/topics/info")
Movies=[]
For l in LST:
The line=l.i nnerHtml
Title=doc. GetElementByTag (' a ', line)
Obj={}
If (title) :
Obj [' href ']=title. Href
Obj (" title ")=title. The text
Star=doc. GetElementByClass (' rating_num, line)
If (star) :
Obj [' star ']=star. Text
Info=doc. GetElementsByTag (' p ', line)
If (info) :
Obj [' info ']='
For I in info:
Obj +=[' info '] i.t ext
Movies. Append (obj)
Return movies

CodePudding user response:

I didn't run,
But see you w to write files in the for loop, so there must be a data covered

  • Related