How to run multiple terminals in parallel to run scrappers written in Python-CodePudding

Let's suppose, I want to open 5 bots at the same time (those bots in python, one the same as the other) that are running web scraping:

from selenium import webdriver
import time

driver = webdriver.Edge("msedgedriver.exe")
options = webdriver.EdgeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--allow-running-insecure-content')
driver = webdriver.Edge(options = options)

driver.get('https://youtu.be/Ykvf7oR0JBY')
time.sleep(30)
driver.close()

Using Java, can I use multprocessing (or something like that, multithreading for example) to run the 5 bots in parallel at the same time? these bots in different terminals working completely in parallel!!

if possible, how can i do that?

if not, what language or framework would be able to perform this task? that's possible in any language?

CodePudding user response：

As mentioned in @Aaron's comment, Python's standard library has both process-based and thread-based parallelism support through multiprocessing and threading respectively. The documentation gives plenty of examples on how to execute functions in parallel. So you wouldn't need to run something else to execute Python processes in parallel, since the Python processes would already run in parallel by themselves.

Regardless, you can execute multiple command line programs in parallel using xargs from GNU Findutils by specifying the --max-args and --max-procs command line options (for more see the reference). With that, you could have a list of websites you wish to scrap in a text file and them do

cat links-to-scrap.txt | xargs --max-lines=1 --max-procs=4 python scrapper.py

and evidently adapt your Python scrip to take the link to scrap from stdin.

Just to make it very explicit: xargs is language independent, so you can execute any CLI application in parallel with it. There are many useful examples of that in its man page.

Finally, as a bonus, you may wish to take a look at executing programs concurrently, not in parallel, using asynchronous programming with asyncio or aiohttp, but I don't know how you could make that work with Selenium (though if you only wish to get the source code from websites you could use aiohhtp with Beautiful Soup).

CodePudding user response：

Here's an example I stripped down out of an existing project of mine of using multiple instances of selenium to scrape webpages... It is not impossible as you have been lead to believe. I'm frankly not sure where that notion would come from in the first place. Python is a first class programming language with the capability to do just about anything. Compiled languages may be faster than interpreted languages, but there's no inherent limitation to Python only running a single process.

import re
from multiprocessing import Process, Queue
from selenium import webdriver


def worker(q: Queue, manifest: dict, mod_dir: str):
    #### setup chromedriver
    options = webdriver.ChromeOptions()
    prefs = {
        "download.default_directory": mod_dir,
        "safebrowsing.enabled": "false",
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
    }
    options.add_experimental_option("prefs",prefs)
    #download ChromeDriver persuant to your version of chrome
    driver = webdriver.Chrome(options=options)  # Optional argument, if not specified will search path.

    #### iterate over links
    for i, link in iter(q.get, None):
        driver.get(link);
        # time.sleep(1)
        source = driver.page_source
        #### find the Project ID
        m = re.search(r"<span>Project ID</span>.*?<span>(\d )</span>", source, flags=re.DOTALL|re.MULTILINE)
        # ... 
        
        
if __name__ == "__main__":
    
    manifest = {...} #some config data
    links = [...] #links to scrape
    n_workers = 4
    
    q = Queue()
    procs = [Process(target=worker, args=(q,manifest)) for _ in range(n_workers)]
    for p in procs:
        p.start()
    for i, l in enumerate(links):
        q.put((i,l))
    for p in procs:
        q.put(None)
    for p in procs:
        p.join()