Text file import on python-CodePudding

My question might have asked earlier but the scenario im working for im not getting any help.

Been tried different methods and things but still no luck, any help would be appreciated

Question

Im trying to load a text file from URL https://www.sec.gov/Archives/edgar/cik-lookup-data.txt, so I can modify the data and create a dataframe.

Example:- data from the link

1188 BROADWAY LLC:0001372374:

119 BOISE, LLC:0001633290:

11900 EAST ARTESIA BOULEVARD, LLC:0001639215:

11900 HARLAN ROAD LLC:0001398414:

11:11 CAPITAL CORP.:0001463262:

I should get below output

   Name                              | number 
   1188 BROADWAY LLC                 | 0001372374 
   119 BOISE, LLC                    | 0001633290 
   11900 EAST ARTESIA BOULEVARD, LLC | 0001639215 
   11900 HARLAN ROAD LLC             | 0001398414 
   11:11 CAPITAL CORP.               | 0001463262

Im struck at 1st problem to load the text file, im keep getting 403 url HTTPError: HTTP Error 403: Forbidden

Reference used:

My Code:-

import urllib.request  # the lib that handles the url stuff

data = urllib.request.urlopen("https://www.sec.gov/Archives/edgar/cik-lookup-data.txt") # it's a file like object and works just like a file
for line in data: # files are iterable
    print (line)

CodePudding user response：

The returned error message says:

Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic. Please declare your traffic by updating your user agent to include company specific information.

You can resolve this as follows:

import urllib

url = "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
hdr = {'User-Agent': 'Your Company Name [email protected]'} #change as needed

req = urllib.request.Request(url, headers=hdr) 

data = urllib.request.urlopen(req, timeout=60).read().splitlines()

>>> data[:10]
[b'!J INC:0001438823:',
 b'#1 A LIFESAFER HOLDINGS, INC.:0001509607:',
 b'#1 ARIZONA DISCOUNT PROPERTIES LLC:0001457512:',
 b'#1 PAINTBALL CORP:0001433777:',
 b'$ LLC:0001427189:',
 b'$AVY, INC.:0001655250:',
 b'& S MEDIA GROUP LLC:0001447162:',
 b'&TV COMMUNICATIONS INC.:0001479357:',
 b'&VEST DOMESTIC FUND II KPIV, L.P.:0001802417:',
 b'&VEST DOMESTIC FUND II LP:0001800903:']

CodePudding user response：

It is disallowed - so you are getting response_code = 403. It is a good practice to check the robots.txt file while scraping any web page. A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; however it is not a mechanism for keeping a web page out of Google.

In your case it is https://www.sec.gov/robots.txt