Home > OS >  How can Beautiful Soup loop through a list of URLs to scrape multiple text fields
How can Beautiful Soup loop through a list of URLs to scrape multiple text fields

Time:09-24

I am attempting to loop through a stored list of URLs to scrape stats about footballers (age, name, club etc).

My list of URLs is stored as playerLinks

playerLinks[:5]
['https://footystats.org/players/england/martyn-waghorn',
 'https://footystats.org/players/norway/stefan-marius-johansen',
 'https://footystats.org/players/england/grady-diangana',
 'https://footystats.org/players/england/jacob-brown',
 'https://footystats.org/players/england/josh-onomah']

If i attempt to scrape an individual link with the following code I am able to retrieve a result.

testreq = Request('https://footystats.org/players/england/dominic-solanke', headers=headers)
html_test = urlopen(testreq)
testsoup = BeautifulSoup(html_test, "html.parser")
testname = testsoup.find('p','col-lg-7 lh14e').text
print(testname)
#Dominic Solanke

However, when I loop through my list of URLs I receive errors. Below is the code I am using, but to no avail.

names = []
#For each player page...
for i in range(len(playerLinks)):

    reqs2 = Request(playerLinks[i], headers=headers)
    html_page = urlopen(reqs2)
    
    psoup2 = BeautifulSoup(html_page, "html.parser")
    
    for x in psoup2.find('p','col-lg-7 lh14e').text
        names.append(x.get('text'))

Once I fix the name scrape I will need to repeat the process for other stats. I have pasted the html of the page below. Do I need to nest another loop within? At the moment i receive either 'invalid syntax' errors or 'no text object' errors.

"<div class="row cf lightGrayBorderBottom "> <p class="col-lg-5 semi-bold lh14e bbox mild-small">Full Name</p> <p class="col-lg-7 lh14e">Dominic Solanke</p></div>"

CodePudding user response:

I'm getting the following output:

Code:

import bs4, requests
from bs4 import BeautifulSoup


playerLinks=['https://footystats.org/players/england/martyn-waghorn',
 'https://footystats.org/players/norway/stefan-marius-johansen',
 'https://footystats.org/players/england/grady-diangana',
 'https://footystats.org/players/england/jacob-brown',
 'https://footystats.org/players/england/josh-onomah']

names = []
#For each player page...
for i in range(len(playerLinks)):

    reqs2 = requests.get(playerLinks[i])
    
    
    psoup2 = BeautifulSoup(reqs2.content, "html.parser")
    
    for x in psoup2.find_all('p','col-lg-7 lh14e'):
        names.append(x.text)
        print(names)
#print(names)

Output:

['Martyn Waghorn']
['Martyn Waghorn', 'England']
['Martyn Waghorn', 'England', 'Forward']
['Martyn Waghorn', 'England', 'Forward', '31 (23 January 1990)']
['Martyn Waghorn', 'England', 'Forward', '31 (23 January 1990)', '71st / 300 players']

CodePudding user response:

Some of these links are blank content. I suspect you need to either be logged in and/or be paying for a subscription (as the do offer an api...but the free one only allows for 1 league)

But to correct that output, move your print statement until the end as opposed to printing after every time you append to the list:

import requests
from bs4 import BeautifulSoup


playerLinks=['https://footystats.org/players/england/martyn-waghorn',
 'https://footystats.org/players/norway/stefan-marius-johansen',
 'https://footystats.org/players/england/grady-diangana',
 'https://footystats.org/players/england/jacob-brown',
 'https://footystats.org/players/england/josh-onomah']

names = []
#For each player page...
for link in playerLinks:
    w=1
    reqs2 = requests.get(link)
    
    
    psoup2 = BeautifulSoup(reqs2.content, "html.parser")
    
    for x in psoup2.find_all('p','col-lg-7 lh14e'):
        names.append(x.text)
print(names)
  • Related