having trouble scraping results (links,dates) from a SEC page while using selenium-CodePudding

i would like to copy links and dates associated with earnings/other reports(8k,10q) respectively for a specific stock. i have tried looping through the results table class but i am getting timeout error while using selenium. Any help would be greatly appreciated.Thank you!

webpage:

approach 1)

approach 2) tired to loop through all the individual results xpath of the table:

#approach1

driver = setupDriver(download_path)
driver.get('https://sec.report/Document/Search/?queryCo=lvs&page=5')
formEle = getElements.clickHiddenEle(geshort,'//*[@id="formType"]')
    #formclick = getElements.clickHiddenEle(geshort,'//*[@id="formType"]/option[97]')
driverwait = WebDriverWait(driver,30)
driverwait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="formType"]/option[97]'))).click()
     
#    //*[@id="filer1"]
driverwait = WebDriverWait(driver,15)

companyname = driverwait.until(EC.presence_of_element_located((By.ID, "filer1")))
companyname.clear()
companyname.send_keys("TSLA")
searchEle = getElements.clickHiddenEle(geshort,'//*[@id="searchtext"]/div[4]/div/div[2]/button')
wait = WebDriverWait(driver, 30)
links = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="results"]/div[2]/table/tbody/a')))
for link in links:
    print(link.get_attribute('href'))

#approach 2 code

driver = setupDriver(download_path)
driver.get('https://sec.report/Document/Search/?queryCo=lvs&page=5')
formEle = getElements.clickHiddenEle(geshort,'//*[@id="formType"]')
    #formclick = getElements.clickHiddenEle(geshort,'//*[@id="formType"]/option[97]')
driverwait = WebDriverWait(driver,30)
driverwait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="formType"]/option[97]'))).click()
     
#    //*[@id="filer1"]
driverwait = WebDriverWait(driver,15)

companyname = driverwait.until(EC.presence_of_element_located((By.ID, "filer1")))
companyname.clear()
companyname.send_keys("TSLA")
searchEle = getElements.clickHiddenEle(geshort,'//*[@id="searchtext"]/div[4]/div/div[2]/button')
href = []
for i in range(0,200):
    wait = WebDriverWait(driver, 40)
    wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="results"]/div[2]/table/tbody/tr[6]/td[' str(i) ']/div[1]')))
    headings = driver.find_elements_by_xpath('//*[@id="results"]/div[2]/table/tbody/tr[' str(i) ']/td[1]/div[1]')
    link = heading.find_element_by_tag_name("a")
    x = link.get_attribute("href")
    href.append(x)
print(href)

CodePudding user response：

Hoping that OP's next question will contain code, not images, and a minimal reproducible example, here is one solution to his conundrum. As stated in comments, Selenium should be the last resort when web scraping - it is a tool meant for testing, not web scraping. The following solution will extract dates, form names, form descriptions and form urls from the 11 pages worth of data concerning LVS:

from bs4 import BeautifulSoup as bs
import requests
from tqdm import tqdm
import pandas as pd

headers = {'User-Agent': 'Sample Company Name AdminContact@<sample company domain>.com'}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 12)):
    r = s.get(f'https://sec.report/Document/Search/?queryCo=lvs&page={x}')
    soup = bs(r.text, 'html.parser')
    data_rows = soup.select('table.table tr')
    for row in data_rows:
        form_title = row.select('a')[0].get_text(strip=True)
        form_url = row.select('a')[0].get('href')
        form_desc = row.select('small')[0].get_text(strip=True)
        form_date = row.select('td')[-1].get_text(strip=True)
        big_list.append((form_date, form_title, form_desc, form_url))
df = pd.DataFrame(big_list, columns = ['Date', 'Title', 'Description', 'Url'])
print(df)

Result:

Date    Title   Description Url
0   2022-09-14  8-K 8-K 8-K Form 8-K - Period Ending 2022-09-14 https://sec.report/Document/0001300514-22-000101/#lvs-20220914.htm
1   2022-08-29  40-APP/A 40-APP/A 40-APP/A  Form 40-APP https://sec.report/Document/0001193125-22-231550/#d351601d40appa.htm
2   2022-07-22  10-Q 10-Q 10-Q  Form 10-Q - Period Ending 2022-06-30    https://sec.report/Document/0001300514-22-000094/#lvs-20220630.htm
3   2022-07-20  8-K 8-K 8-K Form 8-K - Period Ending 2022-07-20 https://sec.report/Document/0001300514-22-000088/#lvs-20220720.htm
4   2022-07-11  8-K 8-K 8-K Form 8-K - Period Ending 2022-07-11 https://sec.report/Document/0001300514-22-000084/#lvs-20220711.htm
... ... ... ... ...
1076    2004-11-22  S-1/A S-1/A S-1/A   Form S-1    https://sec.report/Document/0001047469-04-034893/#a2143958zs-1a.htm
1077    2004-10-25  S-1/A S-1/A S-1/A   Form S-1    https://sec.report/Document/0001047469-04-031910/#a2143958zs-1a.htm
1078    2004-10-20  8-K 8-K 8-K Form 8-K - Period Ending 2004-09-30 https://sec.report/Document/0001047469-04-031637/#a2145253z8-k.htm
1079    2004-10-08  UPLOAD LETTER   Form UPLOAD https://sec.report/Document/0000000000-04-032407/#filename1.txt
1080    2004-09-03  S-1 S-1 S-1 Form S-1    https://sec.report/Document/0001047469-04-028031/#a2142433zs-1.htm
1081 rows × 4 columns

You can extract other stuffs, like company name & person name (I imagine that would be the person filing the form).

Docs for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html Also, requests documentation: https://requests.readthedocs.io/en/latest/ For TQDM, visit https://pypi.org/project/tqdm/ And for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html