I am attempting to loop through a stored list of URLs to scrape stats about footballers (age, name, club etc).
My list of URLs is stored as playerLinks
playerLinks[:5]
['https://footystats.org/players/england/martyn-waghorn',
'https://footystats.org/players/norway/stefan-marius-johansen',
'https://footystats.org/players/england/grady-diangana',
'https://footystats.org/players/england/jacob-brown',
'https://footystats.org/players/england/josh-onomah']
If i attempt to scrape an individual link with the following code I am able to retrieve a result.
testreq = Request('https://footystats.org/players/england/dominic-solanke', headers=headers)
html_test = urlopen(testreq)
testsoup = BeautifulSoup(html_test, "html.parser")
testname = testsoup.find('p','col-lg-7 lh14e').text
print(testname)
#Dominic Solanke
However, when I loop through my list of URLs I receive errors. Below is the code I am using, but to no avail.
names = []
#For each player page...
for i in range(len(playerLinks)):
reqs2 = Request(playerLinks[i], headers=headers)
html_page = urlopen(reqs2)
psoup2 = BeautifulSoup(html_page, "html.parser")
for x in psoup2.find('p','col-lg-7 lh14e').text
names.append(x.get('text'))
Once I fix the name scrape I will need to repeat the process for other stats. I have pasted the html of the page below. Do I need to nest another loop within? At the moment i receive either 'invalid syntax' errors or 'no text object' errors.
"<div class="row cf lightGrayBorderBottom "> <p class="col-lg-5 semi-bold lh14e bbox mild-small">Full Name</p> <p class="col-lg-7 lh14e">Dominic Solanke</p></div>"
CodePudding user response:
I'm getting the following output:
Code:
import bs4, requests
from bs4 import BeautifulSoup
playerLinks=['https://footystats.org/players/england/martyn-waghorn',
'https://footystats.org/players/norway/stefan-marius-johansen',
'https://footystats.org/players/england/grady-diangana',
'https://footystats.org/players/england/jacob-brown',
'https://footystats.org/players/england/josh-onomah']
names = []
#For each player page...
for i in range(len(playerLinks)):
reqs2 = requests.get(playerLinks[i])
psoup2 = BeautifulSoup(reqs2.content, "html.parser")
for x in psoup2.find_all('p','col-lg-7 lh14e'):
names.append(x.text)
print(names)
#print(names)
Output:
['Martyn Waghorn']
['Martyn Waghorn', 'England']
['Martyn Waghorn', 'England', 'Forward']
['Martyn Waghorn', 'England', 'Forward', '31 (23 January 1990)']
['Martyn Waghorn', 'England', 'Forward', '31 (23 January 1990)', '71st / 300 players']
CodePudding user response:
Some of these links are blank content. I suspect you need to either be logged in and/or be paying for a subscription (as the do offer an api...but the free one only allows for 1 league)
But to correct that output, move your print statement until the end as opposed to printing after every time you append to the list:
import requests
from bs4 import BeautifulSoup
playerLinks=['https://footystats.org/players/england/martyn-waghorn',
'https://footystats.org/players/norway/stefan-marius-johansen',
'https://footystats.org/players/england/grady-diangana',
'https://footystats.org/players/england/jacob-brown',
'https://footystats.org/players/england/josh-onomah']
names = []
#For each player page...
for link in playerLinks:
w=1
reqs2 = requests.get(link)
psoup2 = BeautifulSoup(reqs2.content, "html.parser")
for x in psoup2.find_all('p','col-lg-7 lh14e'):
names.append(x.text)
print(names)