Home > other >  Python crawler learn novice, refer to the great god
Python crawler learn novice, refer to the great god

Time:03-22

Refer to the great god, and I climbed in the first page of the site for */6444. The HTML, the second page */6444 _2. HTML, third page */6444 _3. HTML
Starting from the second page crawl rule is easy to get, ask how to add the first page. How do you want ask next code, modify the
 

"" "
@ description: learning python3
@ the author: CHZ
@ datetime: the 2021-03-21 15:19:27
"" "
# import requests module
The import requests
# import BeautifulSoup module
The from bs4 import BeautifulSoup
# import urllib module
The import urllib


X=0
Def getKunvImg (page=2) :
# website picture address
UrlKunv='https://*/2019/6444 _ {} in HTML'. The format (page)
# to add headers to identify this crawlers disguised as a browser to access
Headers={' the user-agent ':' Mozilla/5.0 (Windows NT 10.0; Win64. X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 '}
# a network request, access to the returned HTML
Res=requests. Get (urlKunv, headers=headers)
# set encoding to utf-8
Res. The encoding="utf-8"
# formatted HTML
Soup=BeautifulSoup (res) text, '. The HTML parser)
# through class style for. Img_single tag
For new in soup. Select (' content ') :
Global x
# returns a list of tag number
If len (new. Select (' a ')) & gt; 0:
# get all img tags in ser path inside the picture
Imgsrc=https://bbs.csdn.net/topics/new.select (" img ") [0] [' SRC ']
Output image # address
# print (new. Select (' img) [0] [' SRC '])
# will get to the inside of the SRC path below the images under the images stored in the project file
Urllib. Request. Urlretrieve (imgsrc. '/images/s.j pg' % % x)
X +=1
# download output which zhang
Print (' is downloading the first % d zhang '% x)


For I in range (2, 48) :
Output # download page
Print (' is downloading the first {} page. The format (I))
GetKunvImg (I)




  • Related