Webscraping England Hockey Python BeautifulSoup-CodePudding

I am trying to BeautifulSoup to get the table found in this link: https://gms.englandhockey.co.uk/fixtures-and-results/competitions.php?comp=4154007

It's an England Hockey website and basically I want to download the table and put it in a DataFrame, and also eventually get the fixtures as well.

Whenever I try and find the right div or table, it returns None.

Here's what I have tried:

url = "https://gms.englandhockey.co.uk/fixtures-and-results/club.php?id=Royal Holloway HC&prev=4153800"
page = requests.get(url)

soup = BeautifulSoup(page.text, "html.parser")

I have tried to find the div the table is within, but it returns None.

bread_crumbs = soup.find("div", class_="container")
print(bread_crumbs

Again, I try to find the table but it returns None.

bread_crumbs = soup.find("table")
print(bread_crumbs)

If anyone can suggest a way to access the table, I would be grateful! It might be that Selenium would be better for this, but I haven't used Selenium yet so I am not sure how it would start.

As you can see from the link, it's a php website, so could this be part of the reason?

CodePudding user response：

The url's table data is generated dynamically by javascript.BeautifulSoup cant't grab dynamic data that's why I use selenium with BeautifulSoup and finally, I grab whole table using pandas DataFrame.To see data from the url to accept cookies is a must that I've done at first.To see result, Please just run the code.

Script:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

url ="https://gms.englandhockey.co.uk/fixtures-and-results/competitions.php?comp=4154007"
    
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(10)

coekie_box=driver.find_element_by_css_selector('input#consentCheckbox').click()
time.sleep(1)
coekie_button=driver.find_element_by_css_selector('button.cookieButton').click()
time.sleep(1)

soup = BeautifulSoup(driver.page_source, 'lxml')
table=soup.select_one('table#table_2')
#driver.close()

df = pd.read_html(str(table))[0]
print(df)

Output:

 Date   Time  ...                  Away Team                                      Venue
0    26-Feb  11:45  ...            Tunbridge Wells            Eastbourne Saffrons Sports Club
1       NaN  15:00  ...                      Lewes                Broadwater School - Pitch 1
2       NaN  18:00  ...  Blackheath & Elthamians 1               St Georges College - Pitch 1
3       NaN  13:00  ...               Canterbury 2                      Borden Grammar School
4       NaN  12:00  ...                    Horsham                        Woking HC - Pitch 1
..      ...    ...  ...                        ...                                        ...
97      NaN  13:00  ...                      Lewes  Tonbridge School - Pitch 2 - Rowans Astro
98      NaN  16:30  ...              Sittingbourne               Old Williamsonians Clubhouse
99      NaN  14:30  ...               Guildford M1               St Georges College - Pitch 1
100     NaN  11:45  ...                     Woking            Eastbourne Saffrons Sports Club
101     NaN  12:00  ...                Sevenoaks 2                    College Meadow Pavilion

[102 rows x 6 columns]

CodePudding user response：

Because to access this site you must agree to the use of cookies, and accept their terms and condition

replace request with below code and try again

import requests
from bs4 import BeautifulSoup

url = "https://gms.englandhockey.co.uk/fixtures-and-results/club.php?id=Royal Holloway HC&prev=4153800"

headers = {
  'Cookie': 'visitor-id=vPF0YU5Q; visitor-id-2=bQJHxVCjcBs4Qlmoy72Wzw==; ImportantCookie=0; consentCookie=1; ImportantCookie=0; visitor-id=vPF0YUxZ; visitor-id-2=ZrPCssshUkv7rwB6MVkM2A=='
}

page = requests.get(url, headers=headers)

soup = BeautifulSoup(page.text, "html.parser")

bread_crumbs = soup.find("table")
print(bread_crumbs)