Home > other >  Douban top250 movie name crawl
Douban top250 movie name crawl

Time:10-15


I was a little white. Just started watching the crawler first close tutorial
This is my according to a crawl douban top250 movie name tutorial rewrite the code, the code is problematic, the third time I run has been started to complain, why say it's the third time? Because of my first two successful run up, and really generates a 1. TXT file, inside also has climbed up the movie name, but the name is only about 25


 import requests 
The from bs4 import BeautifulSoup


Def getData (self) :
Res=requests. Get (self)
HTML=res. Text
Return the HTML
Def parseData (self) :
Soup=BeautifulSoup (self, 'LXML)
Res2=soup. The find (' ol ')
MovieList=res2. Find_all (' li ')
MList=[]
For movie in movieList:
Title=movie. Find (' span, class_='title'). The get_text ()
# print (the title)
MList. Append (title)

With the open (' 1. TXT ', 'w') as f:
For the name in mList:
F.w rite (name + '\ n')

Str1='https://movie.douban.com/top250? Start='
Str2='& amp; The filter='
For num in range (20) :
Num +=25
Url=str1 + STR (num) + str2
HTML=getData (url)
ParseData (HTML)






After running error:
 Traceback (the most recent call last) : 
The File "D:/PY/PythonStudy/crawler learning/top250 full version. PY", line 29, in & lt; module>
ParseData (HTML)
The File "D:/PY/PythonStudy/crawler learning/top250 full version. PY", line 12, in parseData
MovieList=res2. Find_all (' li ')
AttributeError: 'NoneType' object has no attribute 'find_all


:
The range of 1. Finally the for loop () parentheses write 1, write 20 can run, why? To write how much can get all of the 250 movie name?
2. Why can run twice before, behind began to complain?

CodePudding user response:

The last line of the error log that res=requests. When times get the res (self) is empty, so the follow-up for BeautifulSoup and find_all operation times' NoneType 'object has no attribute' find_all cuo,

The cause of null:
- 1. Either you BeautifulSoup match statement didn't have to find the correct data, try to check BeautifulSoup is the only and any errors on the support of matching,
- 2. Douban server detected your request for the crawler, returned to the empty data, Suggestions for requests to add headers and other parameters, and set a sleep time,

Another pick up small problems:
With the open (' 1. TXT ', 'w') as f:
- advice is amended as: with the open (' 1. TXT ', 'w', encoding="utf-8") as f: used to specify the encoding format, or you save 1. TXT file big probability gibberish,
Then some advice:
- requests request parameter configuration more, at least put the user-agent configuration, at least pretend the crawlers, specific can search keyword requests headers,
- code the follow-up can be improved in class, more neat and easy to management (based on your learning process improvement),
- BeautifulSoup analytical performance is relatively low, it is suggested that replacement of LXML and use xpath and other matching (based on your learning process improvement),
  • Related