Home > Software engineering >  How to create a list with the values from entity-name column which is visible in "inspect"
How to create a list with the values from entity-name column which is visible in "inspect"

Time:03-14

I'm trying to scrape a list from EDGAR.

The information I need (such as "entity-name") are in the "td" class. However, the code I currently have doesn't return anything. I would appreciate any help. Thanks in advance!

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

s = Service('/PATH/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.sec.gov/edgar/search/#/q=%22cyber%20insurance%22&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K")
try:
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'entity-name')))
except TimeoutException:
    print('Page timed out after 10 secs.')

page = BeautifulSoup(driver.page_source,'html.parser')
print(page)

CodePudding user response:

To extract the texts from the entity-name column instead of presence_of_all_elements_located() you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR and text attribute:

    driver.get('https://www.sec.gov/edgar/search/#/q=%22cyber%20insurance%22&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K')
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "td.entity-name")))])
    
  • Using XPATH and get_attribute("innerHTML"):

    driver.get('https://www.sec.gov/edgar/search/#/q=%22cyber%20insurance%22&dateRange=custom&category=form-cat1&startdt=2011-01-01&enddt=2022-03-12&filter_forms=10-K')
    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td[@class='entity-name']")))])
    
  • Console Output:

    ['Excel Corp ', 'PROGRESSIVE CORP/OH/  (PGR) ', 'Electromed, Inc.  (ELMD) ', 'HOOKER FURNITURE CORP  (HOFT) ', 'HOOKER FURNITURE CORP  (HOFT) ', 'SOUTHERN CO  (SO, SOJA, SOJB, SOJC, SOJD, SOLN) <br> ALABAMA POWER CO  (ALPVN, APRCP, APRDM, APRDN, APRDO, APRDP, ALP-PQ) <br> GEORGIA POWER CO  (GPJA) <br> MISSISSIPPI POWER CO <br> SOUTHERN Co GAS <br> SOUTHERN POWER CO ', 'HOOKER FURNITURE CORP  (HOFT) ', 'SOUTHERN CO  (SO, SOJA, SOJB, SOJC, SOJD, SOLN) <br> ALABAMA POWER CO  (ALPVN, APRCP, APRDM, APRDN, APRDO, APRDP, ALP-PQ) <br> GEORGIA POWER CO  (GPJA) <br> MISSISSIPPI POWER CO <br> SOUTHERN Co GAS <br> SOUTHERN POWER CO ', 'BENCHMARK ELECTRONICS INC  (BHE) ', 'MARRIOTT INTERNATIONAL INC /MD/  (MAR) ', 'Sprouts Farmers Market, Inc.  (SFM) ', 'CF BANKSHARES INC.  (CFBK) ', 'Repay Holdings Corp  (RPAY) ', 'Sprouts Farmers Market, Inc.  (SFM) ', 'MARRIOTT INTERNATIONAL INC /MD/  (MAR) ', 'Sprouts Farmers Market, Inc.  (SFM) ', 'Albertsons Companies, Inc.  (ACI) ', 'MARRIOTT INTERNATIONAL INC /MD/  (MAR) ', 'MARRIOTT INTERNATIONAL INC /MD/  (MAR) ', 'HENNESSY ADVISORS INC  (HNNA) ', 'Repay Holdings Corp  (RPAY, RPAYW) ', 'Repay Holdings Corp  (RPAY, RPAYW, TBRGU) ', 'Arlo Technologies, Inc.  (ARLO) ', 'Repay Holdings Corp  (RPAY, RPAYW) ', 'NATIONAL HEALTH INVESTORS INC  (NHI) ', 'MOTORCAR PARTS AMERICA INC  (MPAA) ', 'RGC RESOURCES INC  (RGCO) ', 'Arlo Technologies, Inc.  (ARLO) ', 'CRYOLIFE INC  (CRY) ', 'Mimecast Ltd  (MIME) ', 'RGC RESOURCES INC  (RGCO) ', 'MOTORCAR PARTS AMERICA INC  (MPAA) ', 'NOODLES &amp; Co  (NDLS) ', 'PAPA JOHNS INTERNATIONAL INC  (PZZA) ', 'MOTORCAR PARTS AMERICA INC  (MPAA) ', 'MOTORCAR PARTS AMERICA INC  (MPAA) ', 'PAPA JOHNS INTERNATIONAL INC  (PZZA) ', 'MOTORCAR PARTS AMERICA INC  (MPAA) ', 'Sprouts Farmers Market, Inc.  (SFM) ', 'MOTORCAR PARTS AMERICA INC  (MPAA) ', 'GARMIN LTD  (GRMN) ', 'Sprouts Farmers Market, Inc.  (SFM) ', 'nDivision Inc.  (NDVN) ', 'nDivision Inc.  (NDVN) ', 'nDivision Inc.  (NDVN) ', 'WEYCO GROUP INC  (WEYS) ', 'DiamondRock Hospitality Co  (DRH) ', 'Pebblebrook Hotel Trust  (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'Sprouts Farmers Market, Inc.  (SFM) ', 'MYR GROUP INC.  (MYRG) ', 'Chatham Lodging Trust  (CLDT, CLDT-PA) ', 'WEYCO GROUP INC  (WEYS) ', 'INFINITE GROUP INC  (IMCI) ', 'DiamondRock Hospitality Co  (DRH) ', 'Pebblebrook Hotel Trust  (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'DiamondRock Hospitality Co  (DRH, DRH-PA) ', 'Pebblebrook Hotel Trust  (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'DLH Holdings Corp.  (DLHC) ', 'Summit Hotel Properties, Inc.  (INN) ', 'BOYD GAMING CORP  (BYD) ', 'Summit Hotel Properties, Inc.  (INN) ', 'DiamondRock Hospitality Co  (DRH, DRH-PA) ', 'CINCINNATI FINANCIAL CORP  (CINF) ', 'Summit Hotel Properties, Inc.  (INN) ', 'Pebblebrook Hotel Trust  (PEB, PEB-PC, PEB-PD, PEB-PE, PEB-PF) ', 'ARTIVION, INC.  (AORT) ', 'STAR GROUP, L.P.  (SGU) ', 'Pebblebrook Hotel Trust  (PEB, PEB-PE, PEB-PF, PEB-PG, PEB-PH) ', 'RGC RESOURCES INC  (RGCO) ', 'INFINITE GROUP INC  (IMCI) ', 'LEGGETT &amp; PLATT INC  (LEG) ', 'RGC RESOURCES INC  (RGCO) ', 'COSTCO WHOLESALE CORP /NEW  (COST) ', 'DLH Holdings Corp.  (DLHC) ', 'CANTERBURY PARK HOLDING CORP ', 'WEYCO GROUP INC  (WEYS) ', 'DLH Holdings Corp.  (DLHC) ', 'WEYCO GROUP INC  (WEYS) ', 'Canterbury Park Holding Corp  (CPHC) ', 'RGC RESOURCES INC  (RGCO) ', 'IEC ELECTRONICS CORP  (IEC) ', 'INFINITE GROUP INC  (IMCI) ', 'Canterbury Park Holding Corp  (CPHC) ', 'WEYCO GROUP INC  (WEYS) ', 'Canterbury Park Holding Corp  (CPHC) ', 'AMERICAN STATES WATER CO  (AWR) <br> Golden State Water CO ', 'LEGGETT &amp; PLATT INC  (LEG) ', 'Vy Global Growth  (VYGG, VYGG-UN, VYGG-WT) ', 'Summit Hotel Properties, Inc.  (INN) ', 'Vy Global Growth  (VYGG, VYGG-UN, VYGG-WT) ', 'Sunstone Hotel Investors, Inc.  (SHO, SHO-PE, SHO-PF) ', 'CRYOLIFE INC  (CRY) ', 'BOYD GAMING CORP  (BYD) ', 'Sunstone Hotel Investors, Inc.  (SHO, SHO-PE, SHO-PF) ', 'Summit Hotel Properties, Inc.  (INN, INN-PE, INN-PF) ', 'Green Bancorp, Inc.  (GNBC) ', 'TELKONET INC  (TKOI) ', 'COHEN &amp; STEERS INC  (CNS) ', 'Sunstone Hotel Investors, Inc.  (SHO, SHO-PE, SHO-PF) ', 'Green Bancorp, Inc.  (GNBC) ']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Related