I have a scrapping project to do but I have a problem with my request. The goal is to collect information on NFL players but as my request concerns several web pages, I have trouble concatenating my informations. Here is my code :
import requests
import re
import pdb
import pickle
request_headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0" }
url='https://www.spotrac.com/nfl/valuation/2021'
complement=['/quarterback/','/kicker/']
for c in complement:
req=requests.get(url c,headers=request_headers,timeout=10)
content=req.text
pattern = 'a href="https://www.spotrac.com/redirect/player/(. ?(?=/">))'
output=re.findall(pattern,content)
with open('Identifiant','ab') as my_file:
pickle.dump(output,my_file)
for identifiant in output:
urlfiche = "https://www.spotrac.com/redirect/player/" identifiant
req = requests.get(urlfiche,headers=request_headers)
content = req.text
pdb.set_trace()
My problem is the following: When I run the command
print(output)
I only get the player IDs of the last list (the kickers) when I would like to keep all of them (as in my "ID" object) I tried creating an empty list on output and using the append function but it is impossible to concatenate strings with lists.
Does anyone have a solution?
Sorry for my English, I'm French :)
CodePudding user response:
To get one list of all ids just extend your empty list above the for loop with scraped infromation by every iteration:
data.extend(re.findall(pattern,content))
Example
import requests
import re
request_headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0" }
url='https://www.spotrac.com/nfl/valuation/2021'
complement=['/quarterback/','/running-back/','/wide-receiver/','/tight-end/','/left-tackle/','/right-tackle/','/center/','/guard/','/defensive-tackle/','/defensive-end/','/outside-linebacker/','/inside-linebacker/','/cornerback/','/free-safety/','/strong-safety/','/kicker/']
data = []
for c in complement:
req=requests.get(url c,headers=request_headers,timeout=10)
content=req.text
pattern = 'a href="https://www.spotrac.com/redirect/player/(. ?(?=/">))'
data.extend(re.findall(pattern,content))
print(data)
Output
['47599', '47594', '47648', '29036', '4619', '25127', '3745', '25102', '9915', '6078', '72395', '14441', '29041', '19089', '9818', '21751', '47598', '14445', '17249', '14472', '9885', '25096', '72447', '72391', '3983', '18950', '25098', '72380', '72381', '3595', '18949', '47636', '29123', '21847', '72578', '47657', '22247', '12309', '21745', '29059', '72404', '29109', '72415', '21789', '29106', '25134', '48079', '72491', '47661', '21924', '16739', '29089', '25235', '29110', '25573', '14514', '47630', '29282', '25126', '72510', '7743', '19090', '12472', '21782', '18952', '21809', '29234', '25097', '16852',...]