Home > Net >  How do I return the list of names using webscraping that I'm looking for?
How do I return the list of names using webscraping that I'm looking for?

Time:10-18

Very new to webscraping and trying to do a project for myself where I scrape the list of names from the MLB Top 100 Prospects site here: https://www.mlb.com/prospects/top100/

Currently my code looks like the following after I load in the HTML code (although I've used a variety of different techniques):

***from bs4 import BeautifulSoup
import requests
    
#### Parse the html content
soup = BeautifulSoup(html,  "lxml")
#### Find all name tags:
prospects = soup.find_all("div.prospect-heashot__name")
 
#### Iterate through all name tags




for prospect in prospects:

    #### Get text from each tag
    print(prospect.text)***

Final result should look something like:

Francisco Alvarez
Gunnar Henderson
Corbin Carroll
Grayson Rodriguez
Anthony Volpe
etc

Any help would be greatly appreciated!

CodePudding user response:

This was fun problem :) The data is stored inside the page in Json form. You can parse it with json module and then search for relevant data in the nested dict (I used recursion for the task):

import re
import json
import requests
from bs4 import BeautifulSoup


url = "https://www.mlb.com/prospects/top100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

data = json.loads(soup.select_one("[data-init-state]")["data-init-state"])

pat1 = re.compile(r"Player:\d $")
pat2 = re.compile(r"getProspectRankings.*\)\.\d $")


def get_data(o, pat):
    if isinstance(o, dict):
        for k, v in o.items():
            if pat.search(k):
                yield k, v
            else:
                yield from get_data(v, pat)
    elif isinstance(o, list):
        for v in o:
            yield from get_data(v, pat)


players = {}
for k, v in get_data(data, pat1):
    players[k] = v["useName"], v["boxscoreName"]

rankings = []
for k, v in get_data(data, pat2):
    rankings.append((v["rank"], players[v["player"]["id"]]))

for rank, (name, surname) in sorted(rankings):
    print("{:>03}. {:<15} {:<15}".format(rank, name, surname))

Prints:

001. Francisco       Álvarez, F     
002. Gunnar          Henderson      
003. Corbin          Carroll        
004. Grayson         Rodriguez, G   
005. Anthony         Volpe          
006. Jordan          Walker         
007. Marcelo         Mayer          
008. Diego           Cartaya        
009. Eury            Pérez          
010. Jackson         Chourio        
011. Druw            Jones          
012. Jordan          Lawlar         
013. Jackson         Holliday       
014. Elly            De La Cruz     
015. Daniel          Espino         
016. Marco           Luciano        
017. Noelvi          Marte, N       
018. Brett           Baty           
019. Henry           Davis          
020. Taj             Bradley        
021. Kyle            Harrison       
022. Robert          Hassell III    
023. Zac             Veen           
024. Andrew          Painter        
025. Triston         Casas          
026. Bobby           Miller         
027. Ezequiel        Tovar          
028. Elijah          Green          
029. Termarr         Johnson        
030. Pete            Crow-Armstrong 
031. George          Valera         
032. Brooks          Lee            
033. Ricky           Tiedemann      
034. James           Wood           
035. Curtis          Mead           
036. Josh            Jung           
037. Kevin           Parada         
038. Jackson         Jobe           
039. Jasson          Domínguez      
040. Colton          Cowser         
041. Miguel          Vargas, M      
042. Michael         Busch          
043. Max             Meyer          
044. Quinn           Priester       
045. Jack            Leiter         
046. Sal             Frelick        
047. Tyler           Soderstrom     
048. Brennen         Davis, B       
049. Jacob           Berry          
050. Oswald          Peraza         
051. Masyn           Winn           
052. Edwin           Arroyo         
053. Gavin           Williams       
054. Mick            Abel           
055. Cade            Cavalli        
056. Evan            Carter         
057. Colson          Montgomery     
058. Royce           Lewis          
059. Owen            White          
060. Cam             Collier        
061. Adael           Amador         
062. Liover          Peguero        
063. Drew            Romo           
064. Logan           O'Hoppe        
065. Harry           Ford           
066. Andy            Pages          
067. Ken             Waldichuk      
068. Hunter          Brown, H       
069. Brayan          Rocchio        
070. Orelvis         Martinez       
071. Jace            Jung           
072. Gavin           Cross          
073. Matt            McLain         
074. Ryan            Pepiot         
075. Bo              Naylor, B      
076. Jordan          Westburg       
077. Gavin           Stone          
078. Justin          Foscue         
079. Gordon          Graceffo       
080. Matthew         Liberatore     
081. Carson          Williams       
082. Austin          Wells          
083. Jackson         Merrill        
084. Joey            Wiemer         
085. Alex            Ramirez        
086. Kevin           Alcantara      
087. DL              Hall, DL       
088. Alec            Burleson       
089. Brock           Porter         
090. Brandon         Pfaadt         
091. Tink            Hence          
092. Emmanuel        Rodriguez, Em  
093. Nick            Gonzales, N    
094. Zack            Gelof          
095. Oscar           Colas          
096. Ceddanne        Rafaela        
097. Endy            Rodriguez, E   
098. Dylan           Lesko          
099. Tanner          Bibee          
100. Wilmer          Flores         
  • Related