My question might have asked earlier but the scenario im working for im not getting any help.
Been tried different methods and things but still no luck, any help would be appreciated
Question
Im trying to load a text file from URL https://www.sec.gov/Archives/edgar/cik-lookup-data.txt, so I can modify the data and create a dataframe.
Example:- data from the link
1188 BROADWAY LLC:0001372374:
119 BOISE, LLC:0001633290:
11900 EAST ARTESIA BOULEVARD, LLC:0001639215:
11900 HARLAN ROAD LLC:0001398414:
11:11 CAPITAL CORP.:0001463262:
I should get below output
Name | number
1188 BROADWAY LLC | 0001372374
119 BOISE, LLC | 0001633290
11900 EAST ARTESIA BOULEVARD, LLC | 0001639215
11900 HARLAN ROAD LLC | 0001398414
11:11 CAPITAL CORP. | 0001463262
Im struck at 1st problem to load the text file, im keep getting 403 url HTTPError: HTTP Error 403: Forbidden
Reference used:
- Given a URL to a text file, what is the simplest way to read the contents of the text file?
- Python requests. 403 Forbidden
My Code:-
import urllib.request # the lib that handles the url stuff
data = urllib.request.urlopen("https://www.sec.gov/Archives/edgar/cik-lookup-data.txt") # it's a file like object and works just like a file
for line in data: # files are iterable
print (line)
CodePudding user response:
The returned error message says:
Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic. Please declare your traffic by updating your user agent to include company specific information.
You can resolve this as follows:
import urllib
url = "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
hdr = {'User-Agent': 'Your Company Name [email protected]'} #change as needed
req = urllib.request.Request(url, headers=hdr)
data = urllib.request.urlopen(req, timeout=60).read().splitlines()
>>> data[:10]
[b'!J INC:0001438823:',
b'#1 A LIFESAFER HOLDINGS, INC.:0001509607:',
b'#1 ARIZONA DISCOUNT PROPERTIES LLC:0001457512:',
b'#1 PAINTBALL CORP:0001433777:',
b'$ LLC:0001427189:',
b'$AVY, INC.:0001655250:',
b'& S MEDIA GROUP LLC:0001447162:',
b'&TV COMMUNICATIONS INC.:0001479357:',
b'&VEST DOMESTIC FUND II KPIV, L.P.:0001802417:',
b'&VEST DOMESTIC FUND II LP:0001800903:']
CodePudding user response:
It is disallowed - so you are getting response_code = 403. It is a good practice to check the robots.txt file while scraping any web page. A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; however it is not a mechanism for keeping a web page out of Google.
In your case it is https://www.sec.gov/robots.txt