Trying to extract table with hyperlinks from URL and save it in Excel [Python]-CodePudding

I'm trying to extract the table from the URL and save hyperlinks also. Current code saves the table to Excel, but the hyperlinks are not saved. I know it's because pd.read_html extract data as text. How can I extract it with hyperlinks also?

Current code:

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://eservices.bcsc.bc.ca/eder/formsearch.aspx")

## after you press "Search" the table will be displayed.

input("Press ENTER to continue...")  ##Come back to terminal and press ENTER to continue execution

output = driver.find_element(By.XPATH, "/html/body/form/table/tbody/tr/td[2]/div/div/div[2]").get_attribute('outerHTML')

dfs = pd.read_html(output)

xlWriter = pd.ExcelWriter('testreport.xlsx', engine='xlsxwriter')

for i, df in enumerate(dfs):
    df.to_excel(xlWriter, sheet_name='Sheet{}'.format(i))
xlWriter.save()

CodePudding user response：

urls=[x.get_attribute("href") for x in driver.find_elements(By.XPATH,"//a[@href and text()='Report']")]

To get all 78 href values with the text Report you can do the above.

CodePudding user response：

I found the answer. To get all the links from the table (not only report):

## store links from the table into urls
urls = [x.get_attribute("href") for x in driver.find_elements(By.XPATH,"//table/tbody/tr/td[2]/div/div/div[2]/table/descendant::a[@href]")]

to save links into file with the other content of the table:

##extend table with links and store it under new file
df = pd.read_excel("./testreport.xlsx")
df['Link'] = urls
df.to_excel("./testreportwithlinks.xlsx", index=False)