Home > OS >  Parallelize checking of dead URLs
Parallelize checking of dead URLs

Time:03-24

The question is quite easy: Is it possible to test a list of URLs and store in a list only dead URLs (response code > 400) using asynchronous function?

I previously use requests library to do it and it works great but I have a big list of URLs to test and if I do it sequentially it takes more than 1 hour.

I saw a lot of article on how to make parallels requests using asyncio and aiohttp but I didn't see many things about how to test URLs with these libraries.

Is it possible to do it?

CodePudding user response:

You could do something like this using aiohttp and asyncio.

Could be done more pythonic I guess but this should work.

import aiohttp
import asyncio

urls = ['url1', 'url2']


async def test_url(session, url):
    async with session.get(url) as resp:
        if resp.status > 400:
            return url



async def main():

    async with aiohttp.ClientSession() as session:
        tasks = []
        for url in urls:
            tasks.append(asyncio.ensure_future(test_url(session, url)))
        dead_urls = await asyncio.gather(*tasks)
        print(dead_urls)
        
asyncio.run(main())

CodePudding user response:

Very basic example, but this is how I would solve it:

from aiohttp import ClientSession
from asyncio import create_task, gather, run

async def TestUrl(url, session):
    async with session.get(url) as response:
        if response.status >= 400:
            r = await response.text()
            print(f"Site: {url} is dead, response code: {str(response.status)} response text: {r}")

async def TestUrls(urls):
    resultsList: list = []
    async with ClientSession() as session:
        # Maybe some rate limiting?
        partitionTasks: list = [
             create_task(TestUrl(url, session))
             for url in urls]
        resultsList.append(await gather(*partitionTasks, return_exceptions=False))
    # do stuff with the results or return?
    return(resultsList)

async def main():
    urls = []
    test = await TestUrls(urls)

if __name__ == "__main__":
    run(main())

CodePudding user response:

Try using a ThreadPoolExecutor

from concurrent.futures import ThreadPoolExecutor
import requests

url_list=[
    "https://www.google.com",
    "https://www.adsadasdad.com",
    "https://www.14fsdfsff.com",
    "https://www.ggr723tg.com",
    "https://www.yyyyyyyyyyyyyyy.com",
    "https://www.78sdf8sf5sf45sf.com",
    "https://www.wikipedia.com",
    "https://www.464dfgdfg235345.com",
    "https://www.tttllldjfh.com",
    "https://www.qqqqqqqqqq456.com"
]

def check(url):
    r=requests.get(url)
    if r.status_code < 400:
        print(f"{url} is ALIVE")

with ThreadPoolExecutor(max_workers=5) as e:
    for url in url_list:
        e.submit(check, url)

CodePudding user response:

Using multithreading you could do it like this:

import requests
from concurrent.futures import ThreadPoolExecutor

results = dict()

# test the given url 
# add url and status code to the results dictionary if GET succeeds but status code >= 400
# also add url to results dictionary if an exception arises with full exception details
def test_url(url):
    try:
        r = requests.get(url)
        if r.status_code >= 400:
            results[url] = f'{r.status_code=}'
    except requests.exceptions.RequestException as e:
        results[url] = str(e)

# return a list of URLs to be checked. probably get these from a file in reality
def get_list_of_urls():
    return ['https://facebook.com', 'https://google.com', 'http://google.com/nonsense', 'http://goooglyeyes.org']

def main():
    with ThreadPoolExecutor() as executor:
        executor.map(test_url, get_list_of_urls())
    print(results)

if __name__ == '__main__':
    main()

CodePudding user response:

Multiprocessing could be the better option for your problem.

from multiprocessing import Process
from multiprocessing import Manager
import requests

def checkURLStatus(url, url_status):
    res = requests.get(url)
    if res.status_code >= 400:
        url_status[url] = "Inactive"
    else:
        url_status[url] = "Active"

if __name__ == "__main__":
    urls = [
    "https://www.google.com"
    ]
    manager = Manager()
    # to store the results for later usage
    url_status = manager.dict()

    procs = []

    for url in urls:
        proc = Process(target=checkURLStatus, args=(url, url_status))
        procs.append(proc)
        proc.start()
    
    for proc in procs:
        proc.join()
    print(url_status.values())

url_status is a shared variable to store data for separate threads. Refer this page for more info.

  • Related