The basic process of the crawler
What is the Request and the Response
What the Request contains
What the Response contains
How to save the data
What is the crawler ___________?
(1) is simply: request site and extract data automation program
Operation: [2] right - & gt; Review elements (if want to link information extracted from the HTM code to extract text and link information, using libraries parse the above information, stored in a structured data)
2. The basic flow of the crawler
(1) robots. TXT: by increased robots to crawl the mother behind the website TXT, can find out all not crawl site
Request 2: request via HTTP library to the target site, which sends a request, the request may contain additional information such as headers, waiting for a server response
(3) to obtain the Response content: if the server can normal Response to get the page content, will get a Response, the Response is that the content of the type could be HTML, JSON string, such as binary data types;
(4) parsing content: the content may be HTML, you can use regular expressions, parsing library for parsing of the web page; May be JSON, can be directly converted to parse the JSON object, may be a binary data, you can do to save or further processing.
5] save data: save a variety of forms, can save as text, also can be saved to the database, or save the specific format file
3. What is the Request and the Response?
(1) request: there are mainly the GET and POST two types, as well as the HEAD, PUT, DELETE, OPTIONS, etc.,
2 request URL: the URL to the uniform resource locator, such as a web page document, a photo, a video URL can be used only to determine
(3) request header: contains the header information request, such as the user-agent, the Host, the information such as Cookies,
(4) the request body: carry request additional data, such as when the form submission form data (Eg: fromdata)
4. The Response contains?
(1) the response status: there are multiple response status, such as 200 represent success, 301 is a jump, 404 to find web pages, 502 server error
2 response headers, such as content type, length, content server information, setting cookies, etc.
(3) response body: the main part, contains the content of the request resources, such as web HTML, image binary data such as
The import requests
The response=requests. Get (' http://www.baidu.com ')
Print (response. The text) # response body
Print (the response headers) # header
Print (response. Status_code) response status code #
Headers={} # define a header information
Headers={Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36} # times header information for the computer itself information
Response1=requests. Get (' http://www.baidu.com 'headers=headers) # with the new headers request baidu
Print (response1. Status_code) # test with the response status code
⒌ can grasp how data
(1) the web text (such as HTML documents, JSON format text, etc.)
2 pictures (for binary text files, save image format)
Response2=requests. Get (' https://www.baidu.com/img/bd_logo1.png ') # grab images
Print (response2. Content) # printed for binary format
With the open (' F:/dd. PNG ', 'wb) as: F #, create a file format for image format
(3) video (the same as a binary file, save as audio formats, fetching method such as image grab)
Pictures of analytic way
(1) directly with
2 Json parsing
(3) regular expression
(4) BeautifulSoup
5] PyQuery
[6] XPath
But how to save the data?
(1) text: plain text, Json, Xml, etc.
2 relational database, such as MySQL, Oracle, SQL Server, etc have structured table structure is stored
(3) non-relational database: mongo, Redis Key - such as the Value is stored
(4) binary files, such as image, video, audio, and so on can be saved directly into a specific format
CodePudding user response:
CodePudding user response:
CodePudding user response:
CodePudding user response:
The