Home > database >  Python - Downloading PDFs from website behind dropdown
Python - Downloading PDFs from website behind dropdown

Time:06-14

For the site https://www.wsop.com/tournaments/results/, the objective is to download all available PDFs on the REPORTS section, behind all different drop down options where they are available.

Currently I am trying to do this using selenium, because I couldn't find an api, but I am open to other suggestions. For now the code is a bunch of copy-paste from relevant questions and YT videos.

My plan of attack is to select an option in the drop-down menu, press 'GO' (to load them), navigate to 'REPORTS' (if available) and download all the PDFs available. And then iterate over all options. Challenge 2 is then to get the PDFs to something like a dataframe to do some analysis.

Below is my current code, that only manages to download the top PDF of the by default selected option in the drop-down:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
import os

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

#settings and loading webpage
options=Options()
options.headless=True
CD=ChromeDriverManager().install()

driver=webdriver.Chrome(CD,options=options)

params={'behavior':'allow','downloadPath':os.getcwd() '\\PDFs'}
driver.execute_cdp_cmd('Page.setDownloadBehavior',params)

driver.get('https://www.wsop.com/tournaments/results/')

#Go through the dropdown
drp=Select(driver.find_element_by_id("CPHbody_aid"))
drp.select_by_index(0)

drp=Select(driver.find_element_by_id("CPHbody_grid"))
drp.select_by_index(1)

drp=Select(driver.find_element_by_id("CPHbody_tid"))
drp.select_by_index(5)

#Click the necessary buttons (section with issues)
driver.find_element_by_xpath('//*[@id="nav-tabs"]/a[6]').click()

#driver.find_element_by_name('GO').click()
#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "GO"))).click()

#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "REPORTS"))).click()

a=driver.find_element_by_id("reports").click()

I can navigate through the drop-down just fine (and it should be easy to iterate over them). However, I do not get the 'GO' button pressed. I tried it a bunch of different ways, a few I showed as a comment in the code.

I am able to press the REPORTS tab, but I think that breaks down when there are different amounts of tabs, the line in the comments might work better, but for now I am not able to download all PDFs anyway, it just takes the first PDF of the page.

Many thanks to whoever can help:)

CodePudding user response:

I am not going to write you the whole script but here's how to click on the "go" button : We can see from the Developper tools that the button is the only element to have the class "submit-red-button", so we can access it with : driver.find_elements_by_class_name('submit-red-button')[0].click()

You say that you can access the Reports tab but it did not work when I tested your program so just in case, you can use driver.find_elements_by_class_name('taboff')[4] to get it.

Then, all you need to do is to click on each pdf link in order to download the files

CodePudding user response:

Based on @Arnaud Rajon's answer, here the improved version of my code (thought it better to comment than to change my original question, hope thats the right way):

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
import os

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options=Options()
options.headless=False
#CD=ChromeDriverManager().install()

driver=webdriver.Chrome(CD,options=options)

params={'behavior':'allow','downloadPath':os.getcwd() '\\PDFs'}
driver.execute_cdp_cmd('Page.setDownloadBehavior',params)

driver.get('https://www.wsop.com/tournaments/results/')

#"CPHbody_aid","CPHbody_grid","CPHbody_tid"
drp=Select(driver.find_element_by_id("CPHbody_aid"))
drp.select_by_index(0)

drp=Select(driver.find_element_by_id("CPHbody_grid"))
drp.select_by_index(1)

drp=Select(driver.find_element_by_id("CPHbody_tid"))
drp.select_by_index(6)

driver.find_elements_by_class_name('submit-red-button')[0].click()
driver.find_elements_by_class_name('taboff')[-1].click()

This accurately finds the right dropdown option and selects the right tab. I edited the tab selection to [-1], because the REPORTS section is usually the last one. Not all tournaments have the same amount of tabs, so [4] fails. Is there a way to select the item in the list based on its name (in this case 'REPORTS')? That would be even better.

Secondly, I want to click the PDF links. I expected either of the following to work in a similar fashion as the class method finding 'taboff'. So they would download the first PDF, however it doesn't.

driver.find_elements_by_xpath('//*[@id="reports"]')[0].click()
driver.find_elements_by_css_selector('#reports')[0].click()

Also, how do I access all the individual pdf links? Because the output of these find methods is a list with a single webelement, not a list of elements as I would have expected.

As is probably already clear, I have almost no experience in HTML, so sorry for the potentially obvious questions. Thx for the help already and hope you can help me drive it home:)

  • Related