Home > Blockchain >  BS4: Doesn't detect all tags with find_all
BS4: Doesn't detect all tags with find_all

Time:08-01

Im' trying to webscraping this url: https://baloncestoenvivo.feb.es/partido/2218269

And I try to get all the div's with this class = "box-datos-partido". When I try to get all of them with:

soup.find_all("div", class_="box-datos-partido")

I've got only one of the two div's there are in the web page. I've got an array with only one element. The content of this element is:

<div >
    <div >
        <span >Fecha</span>
        <span >31/10/2021 - 12:00</span>
    </div>
    <div >
        <span >Árbitros</span>
        <span >DIAZ DE SARRALDE MARTIN, IÑIGO</span>
        <span >SANCHEZ NUÑEZ, UNAI</span>
        <span ></span>
    </div>
    <div >
        <span >Pista</span>
        <span >POLIDEPORTIVO URRETA</span>
        <span >Galdakao (Vizcaya)</span>
    </div>
</div>

When we should be receive an array with two elements. The content of this two elements should be:

<div >
    <div >
        <span >Fecha</span>
        <span >31-10-2021 - 12:00</span>
    </div>
    <div >
        <span >Árbitros</span>
        <span >DIAZ DE SARRALDE MARTIN, IÑIGO</span><span >SANCHEZ NUÑEZ, UNAI</span><span ></span>
    </div>
    <div >
        <span >Pista</span>
        <span >POLIDEPORTIVO URRETA</span><span >BIZKAIA KALEA, S/N, Vizcaya (Galdakao)</span>
    </div>
</div>

<div >
    <div >
        <span >Fecha</span>
        <span >31/10/2021 - 12:00</span>
    </div>
    <div >
        <span >Árbitros</span>
        <span >DIAZ DE SARRALDE MARTIN, IÑIGO</span>
        <span >SANCHEZ NUÑEZ, UNAI</span>
        <span ></span>
    </div>
    <div >
        <span >Pista</span>
        <span >POLIDEPORTIVO URRETA</span>
        <span >Galdakao (Vizcaya)</span>
    </div>
</div>

How is that possible? What am I doing wrong to receive only one element of the two?

CodePudding user response:

Actually, two divs with the same class = "box-datos-partido" that's right but if you make disabled JavaScript then you will notice that the same selection is selecting only one of them(first one) because rest of them are loaded dynamically by JavaScript. If you want to pull them then you can take help with an automation tool something like selenium. Here I use selenium with bs4 to grab the right divs with html content.

Example:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://baloncestoenvivo.feb.es/partido/2218269'
driver.get(url)
driver.maximize_window()
time.sleep(5)

soup=BeautifulSoup(driver.page_source,'lxml')
for card in soup.select('div.box-datos-partido'):
    print(card.prettify())

Output:

<div >
 <div >
  <span >
   Fecha
  </span>
  <span >
   31-10-2021 - 12:00
  </span>
 </div>
 <div >
  <span >
   Árbitros
  </span>
  <span >
   DIAZ DE SARRALDE MARTIN, IÑIGO
  </span>
  <span >
   SANCHEZ NUÑEZ, UNAI
  </span>
  <span >
  </span>
 </div>
 <div >
  <span >
   Pista
  </span>
  <span >
   POLIDEPORTIVO URRETA
  </span>
  <span >
   BIZKAIA KALEA, S/N, Vizcaya (Galdakao)
  </span>
 </div>
</div>

<div >
 <div >
  <span >
   Fecha
  </span>
  <span >
   31/10/2021 - 12:00
  </span>
 </div>
 <div >
  <span >
   Árbitros
  </span>
  <span >
   DIAZ DE SARRALDE MARTIN, IÑIGO
  </span>
  <span >
   SANCHEZ NUÑEZ, UNAI
  </span>
  <span >
  </span>
 </div>
 <div >
  <span >
   Pista
  </span>
  <span >
   POLIDEPORTIVO URRETA
  </span>
  <span >
   Galdakao (Vizcaya)
  </span>
 </div>
</div>

CodePudding user response:

The data you see is loaded via JavaScript from external URL. To load it, you can use requests module (this example will load the players into 2 pandas dataframes):

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup


headers = {
    "Authorization": "Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImQzOWE5MzlhZTQyZmFlMTM5NWJjODNmYjcwZjc1ZDc3IiwidHlwIjoiSldUIn0.eyJuYmYiOjE2NTkyNjM1MDUsImV4cCI6MTY1OTM0OTkwNSwiaXNzIjoiaHR0cHM6Ly9pbnRyYWZlYi5mZWIuZXMvaWRlbnRpdHkuYXBpIiwiYXVkIjpbImh0dHBzOi8vaW50cmFmZWIuZmViLmVzL2lkZW50aXR5LmFwaS9yZXNvdXJjZXMiLCJsaXZlc3RhdHMuYXBpIl0sImNsaWVudF9pZCI6ImJhbG9uY2VzdG9lbnZpdm9hcHAiLCJpZGFtYml0byI6IjEiLCJyb2xlIjpbIk92ZXJWaWV3IiwiVGVhbVN0YXRzIiwiU2hvdENoYXJ0IiwiUmFua2luZyIsIktleUZhY3RzIiwiQm94U2NvcmUiXSwic2NvcGUiOlsibGl2ZXN0YXRzLmFwaSJdfQ.YDVnzLhZAw8kzE2LLjiS8VZayY-sfUgqMN4zdnjROLImHRamOJ_Htz4ehK26QcpywfZmrD5iUWnFnRFJrJyZdhudOp09B0tmn4HnWs4JHcQBirUpdLi4oDqONctn1J31OktVhHYpAS36Fs-2KTjwHcgR4G-EQsA6vxjkLKYjw6we0oY5w1Q_GUqRmEvfDQY3b2a-VlFEcxMQBS6XFfEL4naSz84w9aW2e7UCnic_Mm4CHzN1RzitcBSiunQyINshQzg-1G4STARAZZjfaVZCP8SDB4bWeuaXYxkwX40vbisJD8mXFP1xN93THlIg-d0LNfZg8iqD0Lx8xRf9nRdXug"
}
url = "https://intrafeb.feb.es/LiveStats.API/api/v1/BoxScore/2218269"
data = requests.get(url, headers=headers).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

t1 = data["BOXSCORE"]["TEAM"][0]["PLAYER"]
t2 = data["BOXSCORE"]["TEAM"][1]["PLAYER"]

df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)

print(df1)
print(df2)

Prints:

  p1m p1a    p1p p2m p2a   p2p p3m p3a   p3p fgm fga   fgp   min minFormatted   sta bs tc mt ro rd rt rf to st   ind pllss   val assist reb pf pts inn       id  no                   name                                                           logo
0   4   6   66,7   0   5   0,0   0   6   0,0   0  11   0,0  1812        30:12  None  0  0  0  0  3  3  5  6  1  None    -1  None      1   3  1   4   1  2188507   0    J. ROYALE SACRISTAN  https://competiciones.feb.es/estadisticas/Foto.aspx?c=2188507
1   0   0    0,0   0   5   0,0   0   0   0,0   0   5   0,0  1021        17:01  None  0  0  0  1  5  6  0  2  1  None   -20  None      0   6  0   0   0  2188508   2    O. ARENAS DE LA HOZ  https://competiciones.feb.es/estadisticas/Foto.aspx?c=2188508
2   0   0    0,0   1   2  50,0   0   1   0,0   1   3  33,3  1363        22:43  None  0  0  0  0  2  2  1  2  1  None    -4  None      1   2  0   2   0  2277838   4    A. RAMASCO CERECERO  https://competiciones.feb.es/estadisticas/Foto.aspx?c=2277838

...
  • Related