So I'm trying to scrape data from the DOL website for a project using selenium with python. I'm trying to scrape the column data to be combined into a data frame. The problem is that the first two columns are coded under <th>
tags so an xpath command doesn't work when trying to extract this data. I really need help with this. I've been wracking my brain and searching everywhere, I can't find anywhere that this problem is addressed. Please help.
<tr>
<th id="Alabama" align="left">Alabama</th>
<th id="01/04/2020" align="right">01/04/2020</th>
<td headers="Alabama 01/04/2020 initial_claims" align="right">4,578</td>
<td headers="Alabama 01/04/2020 reflecting_week_ended" align="right">12/28/2019</td>
<td headers="Alabama 01/04/2020 continued_claims" align="right">18,523</td>
<td headers="Alabama 01/04/2020 covered_employment" align="right">1,923,741</td>
<td headers="Alabama 01/04/2020 insured_unemployment" align="right">0.96</td>
</tr>
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.action_chains import ActionChains
url = 'https://oui.doleta.gov/unemploy/claims.asp'
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe")
driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_css_selector('input[name="level"][value="state"]').click()
Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
Select(driver.find_element_by_name('enddate')).select_by_value('2022')
driver.find_element_by_css_selector('input[name="filetype"][value="html"]').click()
select = Select(driver.find_element_by_id('states'))
# Iterate through and select all states
for opt in select.options:
opt.click()
input('Press ENTER to submit the form')
driver.find_element_by_css_selector('input[name="submit"][value="Submit"]').click()
headers = []
heads = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[2]/th')
#Collect headers
for h in heads:
headers.append(h.text)
rows = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr')
# Get row count
row_count = len(rows)
cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th/td')
# Get column count
col_count = len(cols)
I've tried this code
cols = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[3]/th' and '//* [@id="content"]/table/tbody/tr[3]/td')
as suggested. However, it still only pulls 5 columns, but as you can see from the HTML above, there are 7 columns. I need them all. Please help?
CodePudding user response:
You can extract data from all the 7 columns by using *
or name()
in the xpath. The xpath would be something like below.
rows = driver.find_elements_by_xpath("//table/tbody/tr")
cols = row.find_elements_by_xpath("./*") # Gets all the columns element within the element row. Use a Dot in the xpath to find elements within an element.
Or
cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']") # Gets all the column elements with tag name "th" or "td" within the element row.
Try like below:
# Get the rows
rows = driver.find_elements_by_xpath("//table/tbody/tr")
# Iterate over the rows
for row in rows:
# Get all the columns for each row.
# cols = row.find_elements_by_xpath("./*")
cols = row.find_elements_by_xpath("./*[name()='th' or name()='td']")
temp = [] # Temproary list
for col in cols:
temp.append(col.text)
print(temp)
['']
['State', 'Filed week ended', 'Initial Claims', 'Reflecting Week Ended', 'Continued Claims', 'Covered Employment', 'Insured Unemployment Rate']
['Alabama', '01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']
['Alabama', '01/11/2020', '3,629', '01/04/2020', '21,143', '1,923,741', '1.10']
['Alabama', '01/18/2020', '2,483', '01/11/2020', '17,402', '1,923,741', '0.90']
...
CodePudding user response:
To scrape the data from both <th>
and <td>
tags you can use List Comprehension and you can use the following Locator Strategies:
Code Block:
driver.get("https://oui.doleta.gov/unemploy/claims.asp") WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"input[value='state']"))).click() Select(driver.find_element_by_name('strtdate')).select_by_value('2020') Select(driver.find_element_by_name('enddate')).select_by_value('2022') Select(driver.find_element_by_id('states')).select_by_visible_text('Alabama') driver.find_element_by_css_selector("input[value='Submit']").click() # To print all the texts from the first row print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3)"))).text) print("*****") # To create a List with all the texts from the first row using List Comprehension print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table[summary*='Report Table'] tbody tr:nth-child(3) [align='right']")))]) driver.quit()
Console Output:
Alabama 01/04/2020 4,578 12/28/2019 18,523 1,923,741 0.96 ***** ['01/04/2020', '4,578', '12/28/2019', '18,523', '1,923,741', '0.96']