Home > Software engineering >  Python web scraping: How to combine selenium and pandas for gathering data on HTML?
Python web scraping: How to combine selenium and pandas for gathering data on HTML?

Time:12-13

I am gathering sports fixtures and results on the webpage, first of all, I am going to use Pandas to scrape, however, there is an option for selecting "timezone" on the page, so I add slenium for the auto-choosing timezone, therefore I do not know how to scrape with pandas after I use slenium. Would everybody please do me a favour, thank you very much.

here is my work:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pandas as pd

PATH ="C:/Users/XXX/Desktop/chromedriver.exe"
driver = webdriver.Chrome( PATH )

driver.get("https://fixturedownload.com")

select = Select(driver.find_element_by_name("timezone"))

select.select_by_value("SE Asia Standard Time" )

driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div/input[1]').click()

List = pd.read_html(I am stuck here)

CodePudding user response:

You don't need selenium. Issue a POST request to the server with your desired timezone (provided appears in dropdown list).

The available values to use appear against the value attribute of the option tags within the parent select element:

enter image description here

Then parse the response to extract your desired download format links e.g. you can grab the header row links for the csvs downloads for all fixtures within each table as follows:

import requests
# import pandas as pd
from bs4 import BeautifulSoup as bs

headers = {'User-Agent': 'Safari/537.36'}

data = {
  'timezone': 'Nepal Standard Time',
  'command': 'Set Timezone'
}

r = requests.post('https://fixturedownload.com/', headers=headers,  data=data)
soup = bs(r.content, 'lxml')
csv_links = ['https://fixturedownload.com'   i['href'] for i in soup.select('.fixture tr:nth-child(1) td:nth-child(3) a')]
print(csv_links)

You can then combine csvs if headers match, simply download and store, manipulate etc.

There is no point using read_html as you will lose the links to the actual data.

CodePudding user response:

To select the timezone as SE Asia Standard Time and scrape the TABLE using Pandas you can use the following Locator Strategies:

Code Block:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver.get("https://fixturedownload.com/")
Select(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//select[@name='timezone']")))).select_by_value("SE Asia Standard Time" )
driver.find_element(By.XPATH, "//input[@value='Set Timezone']").click()
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='fixture']"))).get_attribute("outerHTML")
df  = pd.read_html(data)
print(df)

Console Output:

[                    0                1  ...                          4          5
0        Full fixture  Preview fixture  ...  Download fixture for ICAL  View JSON
1               Teams            Teams  ...                      Teams        NaN
2      Adelaide Crows  Preview fixture  ...  Download fixture for ICAL  View JSON
3      Brisbane Lions  Preview fixture  ...  Download fixture for ICAL  View JSON
4             Carlton  Preview fixture  ...  Download fixture for ICAL  View JSON
5         Collingwood  Preview fixture  ...  Download fixture for ICAL  View JSON
6            Essendon  Preview fixture  ...  Download fixture for ICAL  View JSON
7           Fremantle  Preview fixture  ...  Download fixture for ICAL  View JSON
8        Geelong Cats  Preview fixture  ...  Download fixture for ICAL  View JSON
9     Gold Coast Suns  Preview fixture  ...  Download fixture for ICAL  View JSON
10         GWS Giants  Preview fixture  ...  Download fixture for ICAL  View JSON
11           Hawthorn  Preview fixture  ...  Download fixture for ICAL  View JSON
12          Melbourne  Preview fixture  ...  Download fixture for ICAL  View JSON
13    North Melbourne  Preview fixture  ...  Download fixture for ICAL  View JSON
14      Port Adelaide  Preview fixture  ...  Download fixture for ICAL  View JSON
15           Richmond  Preview fixture  ...  Download fixture for ICAL  View JSON
16           St Kilda  Preview fixture  ...  Download fixture for ICAL  View JSON
17       Sydney Swans  Preview fixture  ...  Download fixture for ICAL  View JSON
18  West Coast Eagles  Preview fixture  ...  Download fixture for ICAL  View JSON
19   Western Bulldogs  Preview fixture  ...  Download fixture for ICAL  View JSON

[20 rows x 6 columns]]
  • Related