Home > Software engineering >  Why can't I find table when scraping website?
Why can't I find table when scraping website?

Time:07-26

I am trying to fix this code where I'm trying to get table from SEC website of this company

Libraries:

# import our libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup

Definition of parameters for search:

# base URL for the SEC EDGAR browser
endpoint = r"https://www.sec.gov/cgi-bin/browse-edgar"

# define our parameters dictionary
param_dict = {'action':'getcompany',
              'CIK':'1265107',
              'type':'10-k',
              'dateb':'20190101',
              'owner':'exclude',
              'start':'',
              'output':'',
              'count':'100'}

# request the url, and then parse the response.
response = requests.get(url = endpoint, params = param_dict)
soup = BeautifulSoup(response.content, 'html.parser')

# Let the user know it was successful.
print('Request Successful')
print(response.url)

This is where the problem is, when I try to loop over the content of the table i get error shown below as if the table does not exist.

# find the document table with our data
doc_table = soup.find_all('table', class_='tableFile2')

# define a base url that will be used for link building.
base_url_sec = r"https://www.sec.gov"

master_list = []

# loop through each row in the table.
for row in doc_table[0].find_all('tr'):

The error:

enter image description here

Here is the link to the website I am trying to scrape https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=1265107&type=10-k&dateb=20190101&owner=exclude&start=&output=&count=100 When I inspect the elements of the website I can't find anything that could cause this error.

Thank you for any help.

CodePudding user response:

The following table data is static. You can grab the table using pandas without invoking the API url

import pandas as pd
import requests
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
}
url= 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=1265107&type=10-k&dateb=20190101&owner=exclude&start=&output=&count=100'
req=requests.get(url,headers=headers).text
df=pd.read_html(req)[2]
print(df)

Output:

Filings                      Format  ... Filing Date     File/Film Number
0    10-K  Documents Interactive Data  ...  2018-03-07   333-11002518671437   
1    10-K  Documents Interactive Data  ...  2017-03-13   333-11002517683575   
2    10-K  Documents Interactive Data  ...  2016-03-08  333-110025161489854   
3    10-K  Documents Interactive Data  ...  2015-03-06   333-11002515681017   
4    10-K  Documents Interactive Data  ...  2014-03-04   333-11002514664345   
5    10-K  Documents Interactive Data  ...  2013-03-01   333-11002513655933   
6    10-K                   Documents  ...  2006-09-20  333-110025061099734   
7    10-K                   Documents  ...  2005-09-23  333-110025051099353   

[8 rows x 5 columns]

Alternative: Your code is also working fine. Just you have to inject the user-agent as headers.

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
}
endpoint = r"https://www.sec.gov/cgi-bin/browse-edgar"

# define our parameters dictionary
param_dict = {'action':'getcompany',
              'CIK':'1265107',
              'type':'10-k',
              'dateb':'20190101',
              'owner':'exclude',
              'start':'',
              'output':'',
              'count':'100'}

# request the url, and then parse the response.
response = requests.get(url = endpoint, params = param_dict,headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

# Let the user know it was successful.
print('Request Successful')
print(response.url)


doc_table = soup.find_all('table', class_='tableFile2')

# define a base url that will be used for link building.
base_url_sec = r"https://www.sec.gov"

master_list = []

# loop through each row in the table.
for row in doc_table[0].find_all('tr'):
    print(list(row.stripped_strings))

Output:

['Filings', 'Format', 'Description', 'Filing Date', 'File/Film Number']
['10-K', 'Documents', 'Interactive Data', 'Annual report [Section 13 and 15(d), not S-K Item 405]', 'Acc-no: 0001265107-18-000013\xa0(34 Act)\xa0 Size: 11 MB', '2018-03-07', '333-110025', 
'18671437']
['10-K', 'Documents', 'Interactive Data', 'Annual report [Section 13 and 15(d), not S-K Item 405]', 'Acc-no: 0001265107-17-000007\xa0(34 Act)\xa0 Size: 11 MB', '2017-03-13', '333-110025', 
'17683575']
['10-K', 'Documents', 'Interactive Data', 'Annual report [Section 13 and 15(d), not S-K Item 405]', 'Acc-no: 0001265107-16-000052\xa0(34 Act)\xa0 Size: 9 MB', '2016-03-08', '333-110025', '161489854']
  • Related