Home > other >  For help! Python crawl web structure found after the head
For help! Python crawl web structure found after the head

Time:10-19

Two months I novice rookie, self-study... Yesterday to try urllib batch download site pictures, show HTTPError: Forbidden, so want to head to solve the structure, and found that after constructing the head show 'module' object is not callable, beg god to solve!

The first attempt to code:
The import re
The import OS
The from urllib import request

# 5. The requested page function
Def get_html (url) :
Page=request. Urlopen (url)
HTML=page. Read. Decode (" utf-8 ")
Return the HTML
# 6. Regular extract url list function
Def get_url_list (HTML) :
The pattern=r '" auto "SRC=" https://bbs.csdn.net/topics/(. +? '
\. Jpeg)"Img_list=re. The.findall (pattern, HTML)
Return img_list
# 7. Download image function
Def downloadings (url_list) :
# a. create save the path
If not OS. Path. Exsits (" python_pictures ") :
OS. The mkdir (" python_pictures ")
# b. extract the download images
X=1 # download the serial number of the picture, also used as a download the name of the picture
Print (" to start the download, {count} images ". The format (count=len (url_list)))

For img_url in url_list:
Print (" the first {count} images ". The format (count=x))
Request. Urlretrieve (img_url, OS. Path. Join (" python_pictures ", "{num}. JPG", the format (num=x)))
X +=1
Print (" the download is complete, ")


If __name__=="__main__" :
# 1. Make sure the URL
Headers={' the user-agent ':' Mozilla/5.0 (Windows NT 6.1; Win64. X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36 '}
Url="http://www.360doc.com/content/18/0415/17/1127866_745873284.shtml"

# 2. Request page
HTML=get_html (url)
# 3. Based on regular expressions to extract the image url, for download list
Url_list=get_url_list (HTML)
# 4. Based on the download queue download images
Downloadings (url_list)

A second attempt code:
The import re
The import OS
The from urllib import request
# 5. The requested page function
Def get_html (url) :
Headers={' the user-agent ':' Mozilla/5.0 (Windows NT 6.1; Win64. X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36 '}
Page1=request (url, headers=headers)
Page=request. Urlopen (page1)
HTML=page. Read. Decode (" utf-8 ")
Return the HTML
# 6. Regular extract url list function
Def get_url_list (HTML) :
The pattern=r '" auto "SRC=" https://bbs.csdn.net/topics/(. +? '
\. Jpeg)"Img_list=re. The.findall (pattern, HTML)
Return img_list
# 7. Download image function
Def downloadings (url_list) :
# a. create save the path
If not OS. Path. Exsits (" python_pictures ") :
OS. The mkdir (" python_pictures ")
# b. extract the download images
X=1 # download the serial number of the picture, also used as a download the name of the picture
Print (" to start the download, {count} images ". The format (count=len (url_list)))

For img_url in url_list:
Print (" the first {count} images ". The format (count=x))
Request. Urlretrieve (img_url, OS. Path. Join (" python_pictures ", "{num}. JPG", the format (num=x)))
X +=1
Print (" the download is complete, ")

If __name__=="__main__" :
# 1. Make sure the URL
Url="http://www.360doc.com/content/18/0415/17/1127866_745873284.shtml"
# 2. Request page
HTML=get_html (url)
# 3. Based on regular expressions to extract the image url, for download list
Url_list=get_url_list (HTML)
# 4. Based on the download queue download images
Downloadings (url_list)

CodePudding user response:

I found that you call a function without parentheses, do not add parentheses is a function object variables, do?
  • Related