Do the crawl [shell] that rent a house, different pages to get the same result how to solve?-CodePudding

In the process of query, I found that constitute roughly URL: https://. {0} {1} http://zu.ke.com/zufang/pg

And each one housing has only a bed, a specific page link can be accessed,

So I set the MySQL two columns: 1. The ID is done advocate key, 2. The only bed

But in the process of crawl, through changing the number of pages of pg, get great proportion of repeat bed, one page article 30, roughly 100 pages, the final result is only more than three hundred (at first thought didn't write the code, then I checked with single-threaded cycle number, return whether there is a problem, use the print found many different page returns the ID of the are repetitive)

Then I think that is a question of recommendation system, then log in and write a cookie, the result is broadly in this way,

How to solve this problem, thank you,

CodePudding user response:

Don't crawl process involves a complex operation, so get the HTML of the page, the HTML directly stored, then get the next page of HTML, save time, maximum use network,
In addition to open a program to filter the data from the HTML, this time can operate to heavy, all complex operations conducted by the program,