How to change URL for beautifulsoup scraper every time the program runs (without doing it manually)?-CodePudding

I have the following code to scrape Reddit usernames:

    from bs4 import BeautifulSoup
    from requests import get
    from fake_useragent import UserAgent
    
    ua = UserAgent()
    
    
    def lovely_soup(u):
        r = get(u, headers={'User-Agent': ua.chrome})
        return BeautifulSoup(r.text, 'lxml')
    
    
    url = 'https://old.reddit.com/r/aww'
    soup = lovely_soup(url)
    
    titles = soup.findAll('a', {'class': 'author'})
    
    for title in titles:
        print(title.text)

But I have a LOOOONG list of URLs that I would like to scrape Reddit usernames from. I would really like to avoid replacing the URL manually between runs. What would be a way to instead have it replace the URL each time it runs (using a list of URLs that I provide it), and just auto-run until it runs out of URLs?

I'm running this in a virtual environment on PyCharm if that matters. Thank you.

I tried doing it manually but it quickly became exhausting.

CodePudding user response：

I would recommend iterating over the urls, for example you could do the following:

for url in urls:
    soup = lovely_soup(url)
    titles = soup.findAll('a', {'class': 'author'})

    for title in titles:
        print(title.text)

Where urls is your list of all the urls e.g. ["www.google.com", "www.bbc.co.uk", ...]

The above solution prints the title.text for each url. You could modify it slightly to below which stores them and at the end print them all at once:

authors = set()
for url in urls:
    soup = lovely_soup(url)
    titles = soup.findAll('a', {'class': 'author'})

    for title in titles:
        authors.add(title.text)

print(authors)