Mainly to write some difference operation process + static/dynamic web pages
About scrapy beginners learn some commonly used code, as well as operation flow
Create scrapy project: scrapy startproject project name
Create the spiders: scrapy genspider crawler name
Run the crawler: scrapy crawl crawler name
Write scrapy project main documents writing and function:
Item. Py:
The Field of construction item objects file, the item can be seen as a dictionary (the following in order to facilitate understanding I call item dictionary), with the dictionary is very similar, through the Field name=scrapy. Field () to create fields
Spiders. Py:
Write the crawler in the prase function to perform the content of the
The following is the difference between here and the static and dynamic pages page:
A static page:
For a static page, right-click on the page to view the source code can be directly see oneself want to crawl data, for this kind of web page, I mainly use xpath to retrieve data, xpath can obtain rely on Chrome xpath helper, again tie-in a for loop to get the data and store the item in the dictionary, the attention should be paid to the xpath to obtain the content of the text (for) usually need to call again. The extract method into utf-8 encoding, and the content of the xpath returns is always list, add to the item when you remember to add a list of the subscript
Dynamic page:
View the page source code can't directly see data page, they want can also by looking at the response, the body. The decode () judgment is dynamic or static page page, for dynamic pages, cannot obtain data through xpath, correct steps are as follows: through the developer tools, select network, screening XHR, XHR file under the url is the true extract data using url, open the url is a json file types can be found, then the spiders. Py by loading in the json file (deserialization), again through the key and the dictionary of filtering layers, select and obtain to elements added to the item dictionary
After operation, all through the yield item to item to the engine, engine by judging whether the item type data, then the item to pipelines. Py processing
Pipelines. Py:
Used for processing engine to get the item, the operation of the similar to ordinary dictionary data into a json or CSV some file operations, such as go,
Setting. Py:
Remember to open the option of pipelines, robot to shut down options, other options are not learn, don't do record
Want to save the data to the file into the folder to run the crawler, can also be defined in advance in the pipelines of the init function
Record learning to use only for himself, so that forget too fast, also might be wrong, only supplies the reference