Im' trying to webscraping this url: https://baloncestoenvivo.feb.es/partido/2218269
And I try to get all the div's with this class = "box-datos-partido". When I try to get all of them with:
soup.find_all("div", class_="box-datos-partido")
I've got only one of the two div's there are in the web page. I've got an array with only one element. The content of this element is:
<div >
<div >
<span >Fecha</span>
<span >31/10/2021 - 12:00</span>
</div>
<div >
<span >Árbitros</span>
<span >DIAZ DE SARRALDE MARTIN, IÑIGO</span>
<span >SANCHEZ NUÑEZ, UNAI</span>
<span ></span>
</div>
<div >
<span >Pista</span>
<span >POLIDEPORTIVO URRETA</span>
<span >Galdakao (Vizcaya)</span>
</div>
</div>
When we should be receive an array with two elements. The content of this two elements should be:
<div >
<div >
<span >Fecha</span>
<span >31-10-2021 - 12:00</span>
</div>
<div >
<span >Árbitros</span>
<span >DIAZ DE SARRALDE MARTIN, IÑIGO</span><span >SANCHEZ NUÑEZ, UNAI</span><span ></span>
</div>
<div >
<span >Pista</span>
<span >POLIDEPORTIVO URRETA</span><span >BIZKAIA KALEA, S/N, Vizcaya (Galdakao)</span>
</div>
</div>
<div >
<div >
<span >Fecha</span>
<span >31/10/2021 - 12:00</span>
</div>
<div >
<span >Árbitros</span>
<span >DIAZ DE SARRALDE MARTIN, IÑIGO</span>
<span >SANCHEZ NUÑEZ, UNAI</span>
<span ></span>
</div>
<div >
<span >Pista</span>
<span >POLIDEPORTIVO URRETA</span>
<span >Galdakao (Vizcaya)</span>
</div>
</div>
How is that possible? What am I doing wrong to receive only one element of the two?
CodePudding user response:
Actually, two divs with the same class = "box-datos-partido"
that's right but if you make disabled JavaScript then you will notice that the same selection is selecting only one of them(first one) because rest of them are loaded dynamically by JavaScript. If you want to pull them then you can take help with an automation tool something like selenium. Here I use selenium with bs4 to grab the right divs with html content.
Example:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://baloncestoenvivo.feb.es/partido/2218269'
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup=BeautifulSoup(driver.page_source,'lxml')
for card in soup.select('div.box-datos-partido'):
print(card.prettify())
Output:
<div >
<div >
<span >
Fecha
</span>
<span >
31-10-2021 - 12:00
</span>
</div>
<div >
<span >
Árbitros
</span>
<span >
DIAZ DE SARRALDE MARTIN, IÑIGO
</span>
<span >
SANCHEZ NUÑEZ, UNAI
</span>
<span >
</span>
</div>
<div >
<span >
Pista
</span>
<span >
POLIDEPORTIVO URRETA
</span>
<span >
BIZKAIA KALEA, S/N, Vizcaya (Galdakao)
</span>
</div>
</div>
<div >
<div >
<span >
Fecha
</span>
<span >
31/10/2021 - 12:00
</span>
</div>
<div >
<span >
Árbitros
</span>
<span >
DIAZ DE SARRALDE MARTIN, IÑIGO
</span>
<span >
SANCHEZ NUÑEZ, UNAI
</span>
<span >
</span>
</div>
<div >
<span >
Pista
</span>
<span >
POLIDEPORTIVO URRETA
</span>
<span >
Galdakao (Vizcaya)
</span>
</div>
</div>
CodePudding user response:
The data you see is loaded via JavaScript from external URL. To load it, you can use requests
module (this example will load the players into 2 pandas dataframes):
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
"Authorization": "Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImQzOWE5MzlhZTQyZmFlMTM5NWJjODNmYjcwZjc1ZDc3IiwidHlwIjoiSldUIn0.eyJuYmYiOjE2NTkyNjM1MDUsImV4cCI6MTY1OTM0OTkwNSwiaXNzIjoiaHR0cHM6Ly9pbnRyYWZlYi5mZWIuZXMvaWRlbnRpdHkuYXBpIiwiYXVkIjpbImh0dHBzOi8vaW50cmFmZWIuZmViLmVzL2lkZW50aXR5LmFwaS9yZXNvdXJjZXMiLCJsaXZlc3RhdHMuYXBpIl0sImNsaWVudF9pZCI6ImJhbG9uY2VzdG9lbnZpdm9hcHAiLCJpZGFtYml0byI6IjEiLCJyb2xlIjpbIk92ZXJWaWV3IiwiVGVhbVN0YXRzIiwiU2hvdENoYXJ0IiwiUmFua2luZyIsIktleUZhY3RzIiwiQm94U2NvcmUiXSwic2NvcGUiOlsibGl2ZXN0YXRzLmFwaSJdfQ.YDVnzLhZAw8kzE2LLjiS8VZayY-sfUgqMN4zdnjROLImHRamOJ_Htz4ehK26QcpywfZmrD5iUWnFnRFJrJyZdhudOp09B0tmn4HnWs4JHcQBirUpdLi4oDqONctn1J31OktVhHYpAS36Fs-2KTjwHcgR4G-EQsA6vxjkLKYjw6we0oY5w1Q_GUqRmEvfDQY3b2a-VlFEcxMQBS6XFfEL4naSz84w9aW2e7UCnic_Mm4CHzN1RzitcBSiunQyINshQzg-1G4STARAZZjfaVZCP8SDB4bWeuaXYxkwX40vbisJD8mXFP1xN93THlIg-d0LNfZg8iqD0Lx8xRf9nRdXug"
}
url = "https://intrafeb.feb.es/LiveStats.API/api/v1/BoxScore/2218269"
data = requests.get(url, headers=headers).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
t1 = data["BOXSCORE"]["TEAM"][0]["PLAYER"]
t2 = data["BOXSCORE"]["TEAM"][1]["PLAYER"]
df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)
print(df1)
print(df2)
Prints:
p1m p1a p1p p2m p2a p2p p3m p3a p3p fgm fga fgp min minFormatted sta bs tc mt ro rd rt rf to st ind pllss val assist reb pf pts inn id no name logo
0 4 6 66,7 0 5 0,0 0 6 0,0 0 11 0,0 1812 30:12 None 0 0 0 0 3 3 5 6 1 None -1 None 1 3 1 4 1 2188507 0 J. ROYALE SACRISTAN https://competiciones.feb.es/estadisticas/Foto.aspx?c=2188507
1 0 0 0,0 0 5 0,0 0 0 0,0 0 5 0,0 1021 17:01 None 0 0 0 1 5 6 0 2 1 None -20 None 0 6 0 0 0 2188508 2 O. ARENAS DE LA HOZ https://competiciones.feb.es/estadisticas/Foto.aspx?c=2188508
2 0 0 0,0 1 2 50,0 0 1 0,0 1 3 33,3 1363 22:43 None 0 0 0 0 2 2 1 2 1 None -4 None 1 2 0 2 0 2277838 4 A. RAMASCO CERECERO https://competiciones.feb.es/estadisticas/Foto.aspx?c=2277838
...