Home > database >  How to change URL for beautifulsoup scraper every time the program runs (without doing it manually)?
How to change URL for beautifulsoup scraper every time the program runs (without doing it manually)?

Time:03-27

I have the following code to scrape Reddit usernames:

    from bs4 import BeautifulSoup
    from requests import get
    from fake_useragent import UserAgent
    
    ua = UserAgent()
    
    
    def lovely_soup(u):
        r = get(u, headers={'User-Agent': ua.chrome})
        return BeautifulSoup(r.text, 'lxml')
    
    
    url = 'https://old.reddit.com/r/aww'
    soup = lovely_soup(url)
    
    titles = soup.findAll('a', {'class': 'author'})
    
    for title in titles:
        print(title.text)

But I have a LOOOONG list of URLs that I would like to scrape Reddit usernames from. I would really like to avoid replacing the URL manually between runs. What would be a way to instead have it replace the URL each time it runs (using a list of URLs that I provide it), and just auto-run until it runs out of URLs?

I'm running this in a virtual environment on PyCharm if that matters. Thank you.

I tried doing it manually but it quickly became exhausting.

CodePudding user response:

I would recommend iterating over the urls, for example you could do the following:

for url in urls:
    soup = lovely_soup(url)
    titles = soup.findAll('a', {'class': 'author'})

    for title in titles:
        print(title.text)

Where urls is your list of all the urls e.g. ["www.google.com", "www.bbc.co.uk", ...]

The above solution prints the title.text for each url. You could modify it slightly to below which stores them and at the end print them all at once:

authors = set()
for url in urls:
    soup = lovely_soup(url)
    titles = soup.findAll('a', {'class': 'author'})

    for title in titles:
        authors.add(title.text)

print(authors)
  • Related