I am trying to BeautifulSoup to get the table found in this link: https://gms.englandhockey.co.uk/fixtures-and-results/competitions.php?comp=4154007
It's an England Hockey website and basically I want to download the table and put it in a DataFrame, and also eventually get the fixtures as well.
Whenever I try and find the right div
or table
, it returns None
.
Here's what I have tried:
url = "https://gms.englandhockey.co.uk/fixtures-and-results/club.php?id=Royal Holloway HC&prev=4153800"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
I have tried to find the div
the table is within, but it returns None
.
bread_crumbs = soup.find("div", class_="container")
print(bread_crumbs
Again, I try to find the table
but it returns None
.
bread_crumbs = soup.find("table")
print(bread_crumbs)
If anyone can suggest a way to access the table, I would be grateful! It might be that Selenium would be better for this, but I haven't used Selenium yet so I am not sure how it would start.
As you can see from the link, it's a php website, so could this be part of the reason?
CodePudding user response:
The url's table data is generated dynamically by javascript.BeautifulSoup cant't grab dynamic data that's why I use selenium with BeautifulSoup and finally, I grab whole table using pandas DataFrame.To see data from the url to accept cookies is a must that I've done at first.To see result, Please just run the code.
Script:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
url ="https://gms.englandhockey.co.uk/fixtures-and-results/competitions.php?comp=4154007"
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(8)
driver.get(url)
time.sleep(10)
coekie_box=driver.find_element_by_css_selector('input#consentCheckbox').click()
time.sleep(1)
coekie_button=driver.find_element_by_css_selector('button.cookieButton').click()
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
table=soup.select_one('table#table_2')
#driver.close()
df = pd.read_html(str(table))[0]
print(df)
Output:
Date Time ... Away Team Venue
0 26-Feb 11:45 ... Tunbridge Wells Eastbourne Saffrons Sports Club
1 NaN 15:00 ... Lewes Broadwater School - Pitch 1
2 NaN 18:00 ... Blackheath & Elthamians 1 St Georges College - Pitch 1
3 NaN 13:00 ... Canterbury 2 Borden Grammar School
4 NaN 12:00 ... Horsham Woking HC - Pitch 1
.. ... ... ... ... ...
97 NaN 13:00 ... Lewes Tonbridge School - Pitch 2 - Rowans Astro
98 NaN 16:30 ... Sittingbourne Old Williamsonians Clubhouse
99 NaN 14:30 ... Guildford M1 St Georges College - Pitch 1
100 NaN 11:45 ... Woking Eastbourne Saffrons Sports Club
101 NaN 12:00 ... Sevenoaks 2 College Meadow Pavilion
[102 rows x 6 columns]
CodePudding user response:
Because to access this site you must agree to the use of cookies, and accept their terms and condition
replace request with below code and try again
import requests
from bs4 import BeautifulSoup
url = "https://gms.englandhockey.co.uk/fixtures-and-results/club.php?id=Royal Holloway HC&prev=4153800"
headers = {
'Cookie': 'visitor-id=vPF0YU5Q; visitor-id-2=bQJHxVCjcBs4Qlmoy72Wzw==; ImportantCookie=0; consentCookie=1; ImportantCookie=0; visitor-id=vPF0YUxZ; visitor-id-2=ZrPCssshUkv7rwB6MVkM2A=='
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, "html.parser")
bread_crumbs = soup.find("table")
print(bread_crumbs)