Home > other >  The novice crawl data, inquire
The novice crawl data, inquire

Time:09-23

Why the data is not up to?
Program haven't made a mistake, don't know is which part out of the question, or data not caught?
# coding: utf-8
The import re
Import the random
The import requests
The from bs4 import BeautifulSoup
The from openpyxl import Workbook

Wb=Workbook ()
Dest_filename='movie. XLSX'
Ws1=wb. Active
Ws1. Title='movie top250'
DOWNLOAD_URL='https://movie.douban.com/top250'
Def download_page (url) :

Headers={
'the user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64. X64; The rv: 80.0) Gecko/20100101 Firefox/80.0 '
}

Data=https://bbs.csdn.net/topics/requests.get (url, headers=headers). The content
The return data

Def get_data (doc) :
Soup=BeautifulSoup (doc, 'LXML)
Ol=soup. The find (' ol 'class_=' grid_view ')
Name=[]
Star_con=[]
Score=[]
Info_list=[]

For I in ol. Find_all (' li ') :
The detail=i.f ind (' div 'class_=' hd ')
Movie_name=detail. Find (' span, class_='title'). The get_text ()
Level_star=i.f ind (' span, class_='rating_num'). The get_text ()
Star=i.f ind (' div 'class_=' star ')
Star_num=star. The find (test=re.com running (' evaluation '))
Info=i.f ind (' span, class_='inq')
If the info:
Info_list. Append (info. Get_test ())
The else:
Info_list. Append (a 'no')
Score. Append (level_star)
The name, append (movie_name)
Star_con. Append (star_num)
Page=soup. The find (' span, class_='next'), find (' a ')
Print (I)
If page:
Return the name, star_con, score, info_list, DOWNLOAD_URL + page [' href ']

Return the name, star_con, score, info_list, None


Def the main () :
Url=DOWNLOAD_URL
Name=[]
Star_con=[]
Score=[]
Info=[]
While url:
Doc=download_page (url)
Movie star, level_num info_list, url=get_data (doc)
Name=name + movie
Star_con=star_con + star
Score=score + level_num
Info=info_list + info

For (I, m, o, p) in zip (name, star_con, score, info) :
Print (I, m, o, p)
Col_A='% S' A % (name) index (I) + 1)
Col_B='% S' B % (name) index (I) + 1)
Col_C='% S' C % (name) index (I) + 1)
Col_D='% S' D % (name) index (I) + 1)

Ws1 [col_A]=I
Ws1 [col_B]=m
Ws1 [col_C]=o
Ws1 [col_D]=p
Wb. Save (filename=dest_filename)

If __name__=="__main__" :
The main ()
  • Related