Note 1-CodePudding

What is the crawler
The basic process of the crawler
What is the Request and the Response
What the Request contains
What the Response contains
How to save the data

What is the crawler ___________?

(1) is simply: request site and extract data automation program

Operation: [2] right - & gt; Review elements (if want to link information extracted from the HTM code to extract text and link information, using libraries parse the above information, stored in a structured data)

2. The basic flow of the crawler

(1) robots. TXT: by increased robots to crawl the mother behind the website TXT, can find out all not crawl site

Request 2: request via HTTP library to the target site, which sends a request, the request may contain additional information such as headers, waiting for a server response

(3) to obtain the Response content: if the server can normal Response to get the page content, will get a Response, the Response is that the content of the type could be HTML, JSON string, such as binary data types;

(4) parsing content: the content may be HTML, you can use regular expressions, parsing library for parsing of the web page; May be JSON, can be directly converted to parse the JSON object, may be a binary data, you can do to save or further processing.

5] save data: save a variety of forms, can save as text, also can be saved to the database, or save the specific format file

3. What is the Request and the Response?

(1) request: there are mainly the GET and POST two types, as well as the HEAD, PUT, DELETE, OPTIONS, etc.,

2 request URL: the URL to the uniform resource locator, such as a web page document, a photo, a video URL can be used only to determine

(3) request header: contains the header information request, such as the user-agent, the Host, the information such as Cookies,

(4) the request body: carry request additional data, such as when the form submission form data (Eg: fromdata)

4. The Response contains?

(1) the response status: there are multiple response status, such as 200 represent success, 301 is a jump, 404 to find web pages, 502 server error

2 response headers, such as content type, length, content server information, setting cookies, etc.

(3) response body: the main part, contains the content of the request resources, such as web HTML, image binary data such as

The import requests

The response=requests. Get (' http://www.baidu.com ')

Print (response. The text) # response body

Print (the response headers) # header

Print (response. Status_code) response status code #

Headers={} # define a header information

Headers={Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36} # times header information for the computer itself information

Response1=requests. Get (' http://www.baidu.com 'headers=headers) # with the new headers request baidu

Print (response1. Status_code) # test with the response status code

⒌ can grasp how data

(1) the web text (such as HTML documents, JSON format text, etc.)

2 pictures (for binary text files, save image format)

Response2=requests. Get (' https://www.baidu.com/img/bd_logo1.png ') # grab images

Print (response2. Content) # printed for binary format

With the open (' F:/dd. PNG ', 'wb) as: F #, create a file format for image format

(3) video (the same as a binary file, save as audio formats, fetching method such as image grab)

Pictures of analytic way

(1) directly with

2 Json parsing

(3) regular expression

(4) BeautifulSoup

5] PyQuery

[6] XPath

But how to save the data?

(1) text: plain text, Json, Xml, etc.

2 relational database, such as MySQL, Oracle, SQL Server, etc have structured table structure is stored

(3) non-relational database: mongo, Redis Key - such as the Value is stored

(4) binary files, such as image, video, audio, and so on can be saved directly into a specific format

CodePudding user response:

The

reference 3 floor cpongo8 response:

CodePudding user response:

Delphi language about the crawler, isn't it too torn, it's not Delphi strengths, Delphi also nothing important strengths, compared to efficiency and development of usability is nothing like

CodePudding user response:

refer to 6th floor fohoo response:

talk crawler Delphi language, isn't it too ridiculous, this is not the Delphi strengths, Delphi also nothing important strengths, compared to efficiency and development of ease of use is no match for the

How much your resentment ~ ~ ~