Home > Blockchain >  Extract table data from interactive webpage with python
Extract table data from interactive webpage with python

Time:02-04

I would like to get the table data from a certain website to play with the data statistically, however I'm failing on the interactive button which selects each sector from the linked race. How can I iterate through the button list and store each table in a list or a resulting df? An explanation would be appreciated so I can learn how this works. So far I can only extract the text from the first page:

site = "http://live.fis-ski.com/cc-4023/results-pda.htm" 

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome('chromedriver', options= options)
driver.get(site)

try:
    main = WebDriverWait(driver, 10).until(
         EC.presence_of_element_located((By.ID, 'tab1'))   
        )
    print(main.text)
    result = main.text
except:
     driver.quit()

This gives me just a list of the main page with each sector.

Thanks!

CodePudding user response:

With Select you can select the value of the dropdown and change the race type. With .get_attribute('innerText') you can get the values of the hidden rows too (.text doesn't work for them). With pandas you can store data in a dataframe, eventually saving it to csv.

from selenium.webdriver.support.ui import Select
import pandas as pd

... import other libraries ...

... start chromedriver and load site ...

dropdown = WebDriverWait(driver, 9).until(EC.element_to_be_clickable((By.ID, 'int1')))
races = dropdown.text.strip().split('\n')
data = dict.fromkeys(races)

for race in races:
    print(race, end='')
    Select(dropdown).select_by_visible_text(race)
    time.sleep(1)
    rows = driver.find_elements(By.XPATH, "//div[@id='resultpoint1']/ul/li")
    print(f': {len(rows)} athletes')
    table = []
    
    for row in rows:
        columns = row.find_elements(By.XPATH, "./div")
        table.append([c.get_attribute('innerText') for c in columns])

    column_names = [c.text for c in driver.find_elements(By.CSS_SELECTOR, '#result1 .tableheader div:not([class*=hidden])')]
    data[race] = pd.DataFrame(table, columns=column_names)
    # data[race].to_csv(race '.csv', index=False)

Output

Start list: 72 athletes
600 m: 69 athletes
1200 m: 69 athletes
2200 m: 69 athletes
2500 m: 69 athletes
4000 m: 69 athletes
...

Each table is saved in a dict, for example data['2500 m'] prints the following

enter image description here

  • Related