Multithreading or Multiprocessing for web-scrapers-CodePudding

I am dabbling with some web-scrapers (5 total). All of these web-scrapers access different sites with some utilizing selenium and others which do not require selenium. Some take 30seconds to run while others can take up to 45 minutes.

What I would like to do is minimize the time it takes to run these scrapers. Would multi threading be the way to go about this? I've been doing some reading on the subject and it seems like I might be able to just make a thread pool and pass in every scraper to this pool for processing.

Or would multiprocessing be a better approach to running all these scrapers in the fastest amount of time?

CodePudding user response：

I would bet multiprocessing is the way to go as multi-thrading you will be sharing memory and process power between all those processes whilest multiprocessing you'll divide the load between different cores making it a lot faster while dealing with loads of data.

Check out this video, it is very informative and illustrative as well which helps understanding a lot: https://www.youtube.com/watch?v=AZnGRKFUU0c

CodePudding user response：

It really depends on your use cases.

For just normal use case to run on a local machine, multi threading would be enough. Note that too many requests at once most of the time won't speed up your scraping because most of the web nowadays are protected by CloudFlare. Seperating web scrapers that need selenium and the one that doesn't also help a lot because selenium is very slow.

For intensive web scraping especially on large scale each scraper should run seperately on a micro container (AWS EC2 for example) and you can control how many instances of a specific scraper you want to run. Doing this also give you possibility to control the scraper IP address to avoid black listing and requests rate limit.

For Python, I recommend using https://scrapy.org/

CodePudding user response：

I would probably look into asyncio for this task since most of the work is waiting for the site to respond plus I would recommend looking into beautiful soup for working with the webpage and just grabbing the website in itself with the requests module.

It could make your code ridiculously faster.

I would not recommend using multiprocessing if you're using the data in a more complex way because memory is not shared across different interpreters in python and neither are global variables in theory you would have to write the data to a file or a database to make it available, another way to go would be to use a multiprocessing queue. It is essentially a I/O heavy task and not that computationally heavy, multiprocessing is a definite NO.

Multiprocessing requires starting up a whole new process and different memory allocations with a different python interpreter. This takes much longer and doesn't make any sense for this use case. Plus how many processes can you start before you CPU starts going batshit crazy?

Watch this for the insanely faster way of doing webscraping: https://www.youtube.com/watch?v=nFn4_nA_yk8

Use this to get all your data and store it in a list or tuple, and then use a multiprocessing pool for working with the data, should be a little faster than doing it all together.