Home > Enterprise >  List index out of range for a SEC webscraper
List index out of range for a SEC webscraper

Time:12-30

Super noob here who had a friend help me make this webscraper for looking at hedge fund 13fs. It was working fine previously but recently I've been getting this error:

response_two = get_request(sec_url tags[0]['href'])

IndexError: list index out of range

I don't understand why this index isn't working anymore. I've been trying to figure it out by going on the browser console while on the SEC site but I'm having a hard time figuring it out.

Here is the full code:

import requests
import re
import csv
import lxml
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
sec_url = 'https://www.sec.gov'

def get_request(url):
    return requests.get(url)

def create_url(cik):
    return 'https://www.sec.gov/cgi-bin/browse-edgar?CIK={}&owner=exclude&action=getcompany&type=13F-HR'.format(cik)

def get_user_input():
    cik = input("Enter CIK number:")
    return cik

requested_cik = get_user_input()

# Find mutual fund by CIK number on EDGAR
response = get_request(create_url(requested_cik))
soup = BeautifulSoup(response.text, "html.parser")
tags = soup.findAll('a', id="documentsbutton")

# Find latest 13F report for mutual fund
response_two = get_request(sec_url   tags[0]['href'])
soup_two = BeautifulSoup(response_two.text, "html.parser")
tags_two = soup_two.findAll('a', attrs={'href': re.compile('xml')})
xml_url = tags_two[3].get('href')
response_xml = get_request(sec_url   xml_url)
soup_xml = BeautifulSoup(response_xml.content, "lxml")

# DataFrame
df = pd.DataFrame()
df['companies'] = soup_xml.body.findAll(re.compile('nameofissuer'))
df['value'] = soup_xml.body.findAll(re.compile('value'))

for row in df.index:
    df.loc[row, 'value'] = df.loc[row, 'value'].text
    df.loc[row, 'companies'] = df.loc[row, 'companies'].text
df['value'] = df['value'].astype(float)
df = df.groupby('companies').sum()
df = df.sort_values('value',ascending=False)
for row in df.index:
    df.loc[row, 'allocation'] = df.loc[row, 'value']/df['value'].sum()*100
df['allocation'] = df['allocation'].astype(int)
df = df.drop('value', axis=1)
df

Thank you so very much!

CodePudding user response:

Seeing as tags[0] raises IndexError, your problem seems to be that tags is an empty list []

This means thatsoup.findAll is not finding any <a> tags with id=documentsButton in your soup

This could be caused by a typo in the URL, CIK number, or the element id you are searching for.

Seeing as I can't access www.sec.gov, I am not going to try to, meaning I can only provide help for things not pertaining to it

CodePudding user response:

What happens?

Always look at your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the dev tools.

U.S. Securities and Exchange Commission

Your Request Originates from an Undeclared Automated Tool

To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.

Please declare your traffic by updating your user agent to include company specific information.

For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit sec.gov/developer. You can also sign up for email updates on the SEC open data program, including best practices that make it more efficient to download data, and SEC.gov enhancements that may impact scripted downloading processes. For more information, contact [email protected].

For more information, please see the SECâs Web Site Privacy and Security Policy. Thank you for your interest in the U.S. Securities and Exchange Commission.

Reference ID: xxxxxxxxxxxxxxxxxxx.xxxxxxxxxx.xxxxxxxx

More Information

Internet Security Policy

By using this site, you are agreeing to security monitoring and auditing. For security purposes, and to ensure that the public service remains available to users, this government computer system employs programs to monitor network traffic to identify unauthorized attempts to upload or change information or to otherwise cause damage, including attempts to deny service to users.

Unauthorized attempts to upload information and/or change information on any portion of this site are strictly prohibited and are subject to prosecution under the Computer Fraud and Abuse Act of 1986 and the National Information Infrastructure Protection Act of 1996 (see Title 18 U.S.C. §§ 1001 and 1030).

To ensure our website performs well for all users, the SEC monitors the frequency of requests for SEC.gov content to ensure automated searches do not impact the ability of others to access SEC.gov content. We reserve the right to block IP addresses that submit excessive requests. Current guidelines limit users to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests.

If a user or application submits more than 10 requests per second, further requests from the IP address(es) may be limited for a brief period. Once the rate of requests has dropped below the threshold for 10 minutes, the user may resume accessing content on SEC.gov. This SEC practice is designed to limit excessive automated searches on SEC.gov and is not intended or expected to impact individuals browsing the SEC.gov website.

Note that this policy may change as the SEC manages SEC.gov to ensure that the website performs efficiently and remains available to all users.

How to fix?

You can add an user-agent to your request - But you should respect the websites policies.

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}


def get_request(url):
    return requests.get(url,headers=headers)

CodePudding user response:

There's two issues with the script:

  1. The SEC added rate limiting to their website. You aren't alone in facing this issue.. To resolve this, use the fix that HedgeHog described.

  2. The id of the button you're looking for is "documentbuttons" (plural), rather than "documentbutton" (singular). So you need to change the id of the HTML element that you're looking for.

This:

tags = soup.findAll('a', id="documentbutton")

should be this:

tags = soup.findAll('a', id="documentsbutton")

The errors should be gone! (That being said, I can't verify that the dataframe code will work with these requests, since it is cut off in the original post.)

  • Related