I'm trying to extract the table from the URL and save hyperlinks also. Current code saves the table to Excel, but the hyperlinks are not saved. I know it's because pd.read_html
extract data as text. How can I extract it with hyperlinks also?
Current code:
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://eservices.bcsc.bc.ca/eder/formsearch.aspx")
## after you press "Search" the table will be displayed.
input("Press ENTER to continue...") ##Come back to terminal and press ENTER to continue execution
output = driver.find_element(By.XPATH, "/html/body/form/table/tbody/tr/td[2]/div/div/div[2]").get_attribute('outerHTML')
dfs = pd.read_html(output)
xlWriter = pd.ExcelWriter('testreport.xlsx', engine='xlsxwriter')
for i, df in enumerate(dfs):
df.to_excel(xlWriter, sheet_name='Sheet{}'.format(i))
xlWriter.save()
CodePudding user response:
urls=[x.get_attribute("href") for x in driver.find_elements(By.XPATH,"//a[@href and text()='Report']")]
To get all 78 href values with the text Report you can do the above.
CodePudding user response:
I found the answer. To get all the links from the table (not only report):
## store links from the table into urls
urls = [x.get_attribute("href") for x in driver.find_elements(By.XPATH,"//table/tbody/tr/td[2]/div/div/div[2]/table/descendant::a[@href]")]
to save links into file with the other content of the table:
##extend table with links and store it under new file
df = pd.read_excel("./testreport.xlsx")
df['Link'] = urls
df.to_excel("./testreportwithlinks.xlsx", index=False)