How to extract data from both th and td tags using Selenium in Python?-CodePudding

So I'm trying to scrape data from the DOL website for a project using selenium with python. I'm trying to scrape the column data to be combined into a data frame. The problem is that the first two columns are coded under <th> tags so an xpath command doesn't work when trying to extract this data. I really need help with this. I've been wracking my brain and searching everywhere, I can't find anywhere that this problem is addressed. Please help.

   <tr>
   <th id="Alabama" align="left">Alabama</th>
   <th id="01/04/2020" align="right">01/04/2020</th>
   <td headers="Alabama 01/04/2020 initial_claims" align="right">4,578</td>
   <td headers="Alabama 01/04/2020 reflecting_week_ended" align="right">12/28/2019</td>
   <td headers="Alabama 01/04/2020 continued_claims" align="right">18,523</td>
   <td headers="Alabama 01/04/2020 covered_employment" align="right">1,923,741</td>
   <td headers="Alabama 01/04/2020 insured_unemployment" align="right">0.96</td>
   </tr>

   from selenium import webdriver
   from webdriver_manager.chrome import ChromeDriverManager
   from selenium.webdriver.support.select import Select
   from selenium.webdriver.common.action_chains import ActionChains
   
   url = 'https://oui.doleta.gov/unemploy/claims.asp'
   driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe")
   
   driver.implicitly_wait(10)
   driver.get(url)
   driver.find_element_by_css_selector('input[name="level"][value="state"]').click()
   Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
   Select(driver.find_element_by_name('enddate')).select_by_value('2022')
   driver.find_element_by_css_selector('input[name="filetype"][value="html"]').click()
   select = Select(driver.find_element_by_id('states'))

   # Iterate through and select all states
   for opt in select.options:
       opt.click()
   input('Press ENTER to submit the form')
   driver.find_element_by_css_selector('input[name="submit"][value="Submit"]').click()

   headers = []
   heads = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[2]/th')

   #Collect headers
   for h in heads:
       headers.append(h.text)

   rows = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr')
   
   # Get row count
   row_count = len(rows) 

   cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th/td')
   # Get column count
   col_count = len(cols)

I've tried this code

   cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th' and '//* [@id="content"]/table/tbody/tr[3]/td')

as suggested. However, it still only pulls 5 columns, but as you can see from the HTML above, there are 7 columns. I need them all. Please help?

CodePudding user response：

You can extract data from all the 7 columns by using * or name() in the xpath. The xpath would be something like below.

rows = driver.find_elements_by_xpath("//table/tbody/tr")

cols = row.find_elements_by_xpath("./*") # Gets all the columns element within the element row. Use a Dot in the xpath to find elements within an element.
Or 
cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']") # Gets all the column elements with tag name "th" or "td" within the element row.

Try like below:

# Get the rows
rows = driver.find_elements_by_xpath("//table/tbody/tr")

# Iterate over the rows
for row in rows:
    # Get all the columns for each row. 
    # cols = row.find_elements_by_xpath("./*")
    cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']")
    temp = [] # Temproary list
    for col in cols:
        temp.append(col.text)
    print(temp)

['']
['State', 'Filed week ended', 'Initial Claims', 'Reflecting Week Ended', 'Continued Claims', 'Covered Employment', 'Insured Unemployment Rate']
['Alabama', '01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']
['Alabama', '01/11/2020', '3,629', '01/04/2020', '21,143', '1,923,741', '1.10']
['Alabama', '01/18/2020', '2,483', '01/11/2020', '17,402', '1,923,741', '0.90']
...

CodePudding user response：

To scrape the data from both <th> and <td> tags you can use List Comprehension and you can use the following Locator Strategies:

Code Block:

driver.get("https://oui.doleta.gov/unemploy/claims.asp")
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input[value='state']"))).click()
Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
Select(driver.find_element_by_name('enddate')).select_by_value('2022')
Select(driver.find_element_by_id('states')).select_by_visible_text('Alabama')
driver.find_element_by_css_selector("input[value='Submit']").click()
# To print all the texts from the first row
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3)"))).text)
print("*****")
# To create a List with all the texts from the first row using List Comprehension
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3) [align='right']")))])
driver.quit()

Console Output:

Alabama 01/04/2020 4,578 12/28/2019 18,523 1,923,741 0.96
*****
['01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']