Home > front end >  webscraping with python : append informations
webscraping with python : append informations

Time:02-22

I have a scrapping project to do but I have a problem with my request. The goal is to collect information on NFL players but as my request concerns several web pages, I have trouble concatenating my informations. Here is my code :

import requests
import re
import pdb
import pickle

request_headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0" }

url='https://www.spotrac.com/nfl/valuation/2021'
complement=['/quarterback/','/kicker/']

for c in complement:
    req=requests.get(url c,headers=request_headers,timeout=10)
    content=req.text

    pattern = 'a href="https://www.spotrac.com/redirect/player/(. ?(?=/">))'
    output=re.findall(pattern,content)
        
    with open('Identifiant','ab') as my_file:
            pickle.dump(output,my_file)
            
    for identifiant in output:
        urlfiche = "https://www.spotrac.com/redirect/player/" identifiant
        req = requests.get(urlfiche,headers=request_headers)
        content = req.text

           
pdb.set_trace()

My problem is the following: When I run the command

print(output)

I only get the player IDs of the last list (the kickers) when I would like to keep all of them (as in my "ID" object) I tried creating an empty list on output and using the append function but it is impossible to concatenate strings with lists.

Does anyone have a solution?

Sorry for my English, I'm French :)

CodePudding user response:

To get one list of all ids just extend your empty list above the for loop with scraped infromation by every iteration:

data.extend(re.findall(pattern,content))

Example

import requests
import re

request_headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0" }

url='https://www.spotrac.com/nfl/valuation/2021'
complement=['/quarterback/','/running-back/','/wide-receiver/','/tight-end/','/left-tackle/','/right-tackle/','/center/','/guard/','/defensive-tackle/','/defensive-end/','/outside-linebacker/','/inside-linebacker/','/cornerback/','/free-safety/','/strong-safety/','/kicker/']

data = []

for c in complement:
    req=requests.get(url c,headers=request_headers,timeout=10)
    content=req.text    
    pattern = 'a href="https://www.spotrac.com/redirect/player/(. ?(?=/">))'
    data.extend(re.findall(pattern,content))
    
print(data)

Output

['47599', '47594', '47648', '29036', '4619', '25127', '3745', '25102', '9915', '6078', '72395', '14441', '29041', '19089', '9818', '21751', '47598', '14445', '17249', '14472', '9885', '25096', '72447', '72391', '3983', '18950', '25098', '72380', '72381', '3595', '18949', '47636', '29123', '21847', '72578', '47657', '22247', '12309', '21745', '29059', '72404', '29109', '72415', '21789', '29106', '25134', '48079', '72491', '47661', '21924', '16739', '29089', '25235', '29110', '25573', '14514', '47630', '29282', '25126', '72510', '7743', '19090', '12472', '21782', '18952', '21809', '29234', '25097', '16852',...]
  • Related