Very new to webscraping and trying to do a project for myself where I scrape the list of names from the MLB Top 100 Prospects site here: https://www.mlb.com/prospects/top100/
Currently my code looks like the following after I load in the HTML code (although I've used a variety of different techniques):
***from bs4 import BeautifulSoup
import requests
#### Parse the html content
soup = BeautifulSoup(html, "lxml")
#### Find all name tags:
prospects = soup.find_all("div.prospect-heashot__name")
#### Iterate through all name tags
for prospect in prospects:
#### Get text from each tag
print(prospect.text)***
Final result should look something like:
Francisco Alvarez
Gunnar Henderson
Corbin Carroll
Grayson Rodriguez
Anthony Volpe
etc
Any help would be greatly appreciated!
CodePudding user response:
This was fun problem :) The data is stored inside the page in Json form. You can parse it with json
module and then search for relevant data in the nested dict (I used recursion for the task):
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.mlb.com/prospects/top100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("[data-init-state]")["data-init-state"])
pat1 = re.compile(r"Player:\d $")
pat2 = re.compile(r"getProspectRankings.*\)\.\d $")
def get_data(o, pat):
if isinstance(o, dict):
for k, v in o.items():
if pat.search(k):
yield k, v
else:
yield from get_data(v, pat)
elif isinstance(o, list):
for v in o:
yield from get_data(v, pat)
players = {}
for k, v in get_data(data, pat1):
players[k] = v["useName"], v["boxscoreName"]
rankings = []
for k, v in get_data(data, pat2):
rankings.append((v["rank"], players[v["player"]["id"]]))
for rank, (name, surname) in sorted(rankings):
print("{:>03}. {:<15} {:<15}".format(rank, name, surname))
Prints:
001. Francisco Álvarez, F
002. Gunnar Henderson
003. Corbin Carroll
004. Grayson Rodriguez, G
005. Anthony Volpe
006. Jordan Walker
007. Marcelo Mayer
008. Diego Cartaya
009. Eury Pérez
010. Jackson Chourio
011. Druw Jones
012. Jordan Lawlar
013. Jackson Holliday
014. Elly De La Cruz
015. Daniel Espino
016. Marco Luciano
017. Noelvi Marte, N
018. Brett Baty
019. Henry Davis
020. Taj Bradley
021. Kyle Harrison
022. Robert Hassell III
023. Zac Veen
024. Andrew Painter
025. Triston Casas
026. Bobby Miller
027. Ezequiel Tovar
028. Elijah Green
029. Termarr Johnson
030. Pete Crow-Armstrong
031. George Valera
032. Brooks Lee
033. Ricky Tiedemann
034. James Wood
035. Curtis Mead
036. Josh Jung
037. Kevin Parada
038. Jackson Jobe
039. Jasson Domínguez
040. Colton Cowser
041. Miguel Vargas, M
042. Michael Busch
043. Max Meyer
044. Quinn Priester
045. Jack Leiter
046. Sal Frelick
047. Tyler Soderstrom
048. Brennen Davis, B
049. Jacob Berry
050. Oswald Peraza
051. Masyn Winn
052. Edwin Arroyo
053. Gavin Williams
054. Mick Abel
055. Cade Cavalli
056. Evan Carter
057. Colson Montgomery
058. Royce Lewis
059. Owen White
060. Cam Collier
061. Adael Amador
062. Liover Peguero
063. Drew Romo
064. Logan O'Hoppe
065. Harry Ford
066. Andy Pages
067. Ken Waldichuk
068. Hunter Brown, H
069. Brayan Rocchio
070. Orelvis Martinez
071. Jace Jung
072. Gavin Cross
073. Matt McLain
074. Ryan Pepiot
075. Bo Naylor, B
076. Jordan Westburg
077. Gavin Stone
078. Justin Foscue
079. Gordon Graceffo
080. Matthew Liberatore
081. Carson Williams
082. Austin Wells
083. Jackson Merrill
084. Joey Wiemer
085. Alex Ramirez
086. Kevin Alcantara
087. DL Hall, DL
088. Alec Burleson
089. Brock Porter
090. Brandon Pfaadt
091. Tink Hence
092. Emmanuel Rodriguez, Em
093. Nick Gonzales, N
094. Zack Gelof
095. Oscar Colas
096. Ceddanne Rafaela
097. Endy Rodriguez, E
098. Dylan Lesko
099. Tanner Bibee
100. Wilmer Flores