Let's suppose, I want to open 5 bots at the same time (those bots in python, one the same as the other) that are running web scraping:
from selenium import webdriver
import time
driver = webdriver.Edge("msedgedriver.exe")
options = webdriver.EdgeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--allow-running-insecure-content')
driver = webdriver.Edge(options = options)
driver.get('https://youtu.be/Ykvf7oR0JBY')
time.sleep(30)
driver.close()
Using Java, can I use multprocessing (or something like that, multithreading for example) to run the 5 bots in parallel at the same time? these bots in different terminals working completely in parallel!!
if possible, how can i do that?
if not, what language or framework would be able to perform this task? that's possible in any language?
CodePudding user response:
As mentioned in @Aaron's comment, Python's standard library has both process-based and thread-based parallelism support through multiprocessing
and threading
respectively. The documentation gives plenty of examples on how to execute functions in parallel. So you wouldn't need to run something else to execute Python processes in parallel, since the Python processes would already run in parallel by themselves.
Regardless, you can execute multiple command line programs in parallel using xargs
from GNU Findutils by specifying the --max-args
and --max-procs
command line options (for more see the reference). With that, you could have a list of websites you wish to scrap in a text file and them do
cat links-to-scrap.txt | xargs --max-lines=1 --max-procs=4 python scrapper.py
and evidently adapt your Python scrip to take the link to scrap from stdin
.
Just to make it very explicit: xargs
is language independent, so you can execute any CLI application in parallel with it. There are many useful examples of that in its man page.
Finally, as a bonus, you may wish to take a look at executing programs concurrently, not in parallel, using asynchronous programming with asyncio
or aiohttp
, but I don't know how you could make that work with Selenium (though if you only wish to get the source code from websites you could use aiohhtp
with Beautiful Soup).
CodePudding user response:
Here's an example I stripped down out of an existing project of mine of using multiple instances of selenium to scrape webpages... It is not impossible as you have been lead to believe. I'm frankly not sure where that notion would come from in the first place. Python is a first class programming language with the capability to do just about anything. Compiled languages may be faster than interpreted languages, but there's no inherent limitation to Python only running a single process.
import re
from multiprocessing import Process, Queue
from selenium import webdriver
def worker(q: Queue, manifest: dict, mod_dir: str):
#### setup chromedriver
options = webdriver.ChromeOptions()
prefs = {
"download.default_directory": mod_dir,
"safebrowsing.enabled": "false",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
}
options.add_experimental_option("prefs",prefs)
#download ChromeDriver persuant to your version of chrome
driver = webdriver.Chrome(options=options) # Optional argument, if not specified will search path.
#### iterate over links
for i, link in iter(q.get, None):
driver.get(link);
# time.sleep(1)
source = driver.page_source
#### find the Project ID
m = re.search(r"<span>Project ID</span>.*?<span>(\d )</span>", source, flags=re.DOTALL|re.MULTILINE)
# ...
if __name__ == "__main__":
manifest = {...} #some config data
links = [...] #links to scrape
n_workers = 4
q = Queue()
procs = [Process(target=worker, args=(q,manifest)) for _ in range(n_workers)]
for p in procs:
p.start()
for i, l in enumerate(links):
q.put((i,l))
for p in procs:
q.put(None)
for p in procs:
p.join()