How to remove a URL monitoring while script is running?-CodePudding

I have written a script where I am doing a monitor on some webpages and whenever there is a specific html tag found, it should print a notification. The point is to run the script 24/7 and while the script is running, I want to remove URL. I have currently a database where I am going to read the URLS that is being found/removed.

import threading

import requests
from bs4 import BeautifulSoup

# Replacement for database for now
URLS = [
    'https://github.com/search?q=hello world',
    'https://github.com/search?q=python 3',
    'https://github.com/search?q=world',
    'https://github.com/search?q=i love python',
]


def doRequest(url):
    while True:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            if soup.find("span", {"data-search-type": "Repositories"}).text.strip():  # if there are sizes
                sendNotifications({
                    'title': soup.find("input", {"name": "q"})['value'],
                    'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
                })
        else:
            print(url, response.status_code)


def sendNotifications(data):
    ...


if __name__ == '__main__':
    # TODO read URLS from database instead of lists
    for url in URLS:
        threading.Thread(target=doRequest, args=(url,)).start()

The current problem im facing is that the doRequest is in a while loop which is running all the time and I wonder how can I remove a specific URL while the script is running inside a runnable script? e.g. https://github.com/search?q=world

CodePudding user response：

Method 1: A simple approach

What you want is to insert some termination logic in the while True loop so that it constantly checks for a termination signal.

To this end, you can use threading.Event().

For example, you can add a stopping_event argument:

def doRequest(url, stopping_event):
    while True and not stopping_event.is_set():
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            if soup.find("span", {"data-search-type": "Repositories"}).text.strip():  # if there are sizes
                sendNotifications({
                    'title': soup.find("input", {"name": "q"})['value'],
                    'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
                })
        else:
            print(url, response.status_code)

And you create these events when starting the threads

if __name__ == '__main__':
    # TODO read URLS from database instead of lists
    stopping_events = {url: threading.Event() for url in URLS}

    for url in URLS:
        threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()

Whenever you want to stop/remove a particular url, you can just call

stopping_events[url].set()

That particular while loop will stop and exit.

You can even create a separate thread that waits for an user input to stop a particular url:

def manager(stopping_events):
    while True:
        url = input('url to stop: ')
        if url in stopping_events:
            stopping_events[url].set()

if __name__ == '__main__':
    # TODO read URLS from database instead of lists
    stopping_events = {url: threading.Event() for url in URLS}

    for url in URLS:
        threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()
    threading.Thread(target=manager, args=(stopping_events,)).start()

Method 2: A cleaner approach

Instead of having a fixed list of URLs, you can have a thread that keeps reading the list of URLs and feed it to the processing threads. This is the Producer-Consumer pattern. Now you don't really remove any URL. You simply keep processing the later list of URLs from the database. That should automatically take care of newly added/deleted URLs.

import queue
import threading

import requests
from bs4 import BeautifulSoup


# Replacement for database for now
def get_urls_from_db(q: queue.Queue):
    while True:
        url_list = ...  # some db read logic
        map(q.put, url_list)  # putting newly read URLs into queue

def doRequest(q: queue.Queue):
    while True:
        url = q.get()  # waiting and getting url from queue
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            if soup.find("span", {"data-search-type": "Repositories"}).text.strip():  # if there are sizes
                sendNotifications({
                    'title': soup.find("input", {"name": "q"})['value'],
                    'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
                })
        else:
            print(url, response.status_code)


def sendNotifications(data):
    ...


if __name__ == '__main__':
    # TODO read URLS from database instead of lists
    url_queue = queue.Queue()
    for _ in range(10):  # starts 10 threads
        threading.Thread(target=doRequest, args=(url_queue,)).start()

    threading.Thread(target=get_urls_from_db, args=(url_queue,)).start()

get_urls_from_db keeps reading URLs from database and adds the current list of URLs from database to the url_queue to be processed.

In doRequest, each iteration of the loop now grabs one url from the url_queue and processes it.

One thing to watch out for is adding URLs too quickly and processing can't keep up. Then the queue length will grow over time and consume lots of memory.

This is arguably better since now you do have great control over what URLs to process and have a fixed number of threads.