I have the following code to scrape Reddit usernames:
from bs4 import BeautifulSoup
from requests import get
from fake_useragent import UserAgent
ua = UserAgent()
def lovely_soup(u):
r = get(u, headers={'User-Agent': ua.chrome})
return BeautifulSoup(r.text, 'lxml')
url = 'https://old.reddit.com/r/aww'
soup = lovely_soup(url)
titles = soup.findAll('a', {'class': 'author'})
for title in titles:
print(title.text)
But I have a LOOOONG list of URLs that I would like to scrape Reddit usernames from. I would really like to avoid replacing the URL manually between runs. What would be a way to instead have it replace the URL each time it runs (using a list of URLs that I provide it), and just auto-run until it runs out of URLs?
I'm running this in a virtual environment on PyCharm if that matters. Thank you.
I tried doing it manually but it quickly became exhausting.
CodePudding user response:
I would recommend iterating over the urls, for example you could do the following:
for url in urls:
soup = lovely_soup(url)
titles = soup.findAll('a', {'class': 'author'})
for title in titles:
print(title.text)
Where urls is your list of all the urls e.g. ["www.google.com", "www.bbc.co.uk", ...]
The above solution prints the title.text for each url. You could modify it slightly to below which stores them and at the end print them all at once:
authors = set()
for url in urls:
soup = lovely_soup(url)
titles = soup.findAll('a', {'class': 'author'})
for title in titles:
authors.add(title.text)
print(authors)