I'm very new to Python and think I'm 95% there on this one, but truly can't figure out what could be wrong while troubleshooting:
I'm looking to loop through 50,000 URLs, but the only thing changing in the URL is the final number
Essentially making links like this:
"https://basketball.realgm.com/player/Carmelo-Anthony/Summary/1" "https://basketball.realgm.com/player/Carmelo-Anthony/Summary/2" "https://basketball.realgm.com/player/Carmelo-Anthony/Summary/3"
My next thought was to make a working loop, just to ensure I can do it correctly:
for tag in range(0, 4):
resp = ("https://basketball.realgm.com/player/Carmelo-Anthony/Summary/" str(tag))
print(resp)
Based on the output, this seems to create the exact links I want.
I then wanted to merge it with the code that seemed to scrape all HREF tags from a given list of URLs (final code below):
import requests
from bs4 import BeautifulSoup
profiles = []
for tag in range(0, 50000):
resp = ("https://basketball.realgm.com/player/Carmelo-Anthony/Summary/" str(tag))
urls = [
resp
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find('div', class_="profile-box").select('.half-column-left > p > a'):
profile = profile.get('href')
profiles.append(profile)
# print(profiles)
for p in profiles:
if p.startswith('https'):
print(tag, profile)
My confusion then stems from the fact it doesn't ALWAYS work. If I change the range to (0, 7), I do see results.
I did some exploring and saw the URL below gives a 404 tag:
https://basketball.realgm.com/player/Carmelo-Anthony/Summary/8
I figured it should just skip broken links -- I added in an "else" statement, but my results still weren't correct.
Is there something I'm doing wrong here?
CodePudding user response:
You were trying to select a list of elements from an element that sometimes is None
.
Try this:
import requests
from bs4 import BeautifulSoup
profiles = []
for page in range(1, 50000):
req = requests.get("https://basketball.realgm.com/player/Carmelo-Anthony/Summary/{page}".format(page = page))
soup = BeautifulSoup(req.text, 'html.parser')
element = soup.find('div', class_="profile-box")
if element != None:
for profile in element.select('.half-column-left > p > a'):
profiles.append(profile.get('href'))
print(profiles)