Parallelize checking of dead URLs-CodePudding

The question is quite easy: Is it possible to test a list of URLs and store in a list only dead URLs (response code > 400) using asynchronous function?

I previously use requests library to do it and it works great but I have a big list of URLs to test and if I do it sequentially it takes more than 1 hour.

I saw a lot of article on how to make parallels requests using asyncio and aiohttp but I didn't see many things about how to test URLs with these libraries.

Is it possible to do it?

CodePudding user response：

You could do something like this using aiohttp and asyncio.

Could be done more pythonic I guess but this should work.

import aiohttp
import asyncio

urls = ['url1', 'url2']


async def test_url(session, url):
    async with session.get(url) as resp:
        if resp.status > 400:
            return url



async def main():

    async with aiohttp.ClientSession() as session:
        tasks = []
        for url in urls:
            tasks.append(asyncio.ensure_future(test_url(session, url)))
        dead_urls = await asyncio.gather(*tasks)
        print(dead_urls)
        
asyncio.run(main())

CodePudding user response：

Very basic example, but this is how I would solve it:

from aiohttp import ClientSession
from asyncio import create_task, gather, run

async def TestUrl(url, session):
    async with session.get(url) as response:
        if response.status >= 400:
            r = await response.text()
            print(f"Site: {url} is dead, response code: {str(response.status)} response text: {r}")

async def TestUrls(urls):
    resultsList: list = []
    async with ClientSession() as session:
        # Maybe some rate limiting?
        partitionTasks: list = [
             create_task(TestUrl(url, session))
             for url in urls]
        resultsList.append(await gather(*partitionTasks, return_exceptions=False))
    # do stuff with the results or return?
    return(resultsList)

async def main():
    urls = []
    test = await TestUrls(urls)

if __name__ == "__main__":
    run(main())

CodePudding user response：

Try using a ThreadPoolExecutor

from concurrent.futures import ThreadPoolExecutor
import requests

url_list=[
    "https://www.google.com",
    "https://www.adsadasdad.com",
    "https://www.14fsdfsff.com",
    "https://www.ggr723tg.com",
    "https://www.yyyyyyyyyyyyyyy.com",
    "https://www.78sdf8sf5sf45sf.com",
    "https://www.wikipedia.com",
    "https://www.464dfgdfg235345.com",
    "https://www.tttllldjfh.com",
    "https://www.qqqqqqqqqq456.com"
]

def check(url):
    r=requests.get(url)
    if r.status_code < 400:
        print(f"{url} is ALIVE")

with ThreadPoolExecutor(max_workers=5) as e:
    for url in url_list:
        e.submit(check, url)

CodePudding user response：

Using multithreading you could do it like this:

import requests
from concurrent.futures import ThreadPoolExecutor

results = dict()

# test the given url 
# add url and status code to the results dictionary if GET succeeds but status code >= 400
# also add url to results dictionary if an exception arises with full exception details
def test_url(url):
    try:
        r = requests.get(url)
        if r.status_code >= 400:
            results[url] = f'{r.status_code=}'
    except requests.exceptions.RequestException as e:
        results[url] = str(e)

# return a list of URLs to be checked. probably get these from a file in reality
def get_list_of_urls():
    return ['https://facebook.com', 'https://google.com', 'http://google.com/nonsense', 'http://goooglyeyes.org']

def main():
    with ThreadPoolExecutor() as executor:
        executor.map(test_url, get_list_of_urls())
    print(results)

if __name__ == '__main__':
    main()

CodePudding user response：

Multiprocessing could be the better option for your problem.

from multiprocessing import Process
from multiprocessing import Manager
import requests

def checkURLStatus(url, url_status):
    res = requests.get(url)
    if res.status_code >= 400:
        url_status[url] = "Inactive"
    else:
        url_status[url] = "Active"

if __name__ == "__main__":
    urls = [
    "https://www.google.com"
    ]
    manager = Manager()
    # to store the results for later usage
    url_status = manager.dict()

    procs = []

    for url in urls:
        proc = Process(target=checkURLStatus, args=(url, url_status))
        procs.append(proc)
        proc.start()
    
    for proc in procs:
        proc.join()
    print(url_status.values())

url_status is a shared variable to store data for separate threads. Refer this page for more info.