I am attempting to return only the first url that pops up when scraping "https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include& Istart=0&count=40&output=atom." However, while a list is created when scraping, it is archived incorrectly, as the [0] in the list returns "h", [1] returns "t" and so on.
For example, outputting print(link[0])
does not return the first link, but returns h h h h
How can I make it so I only return the first URL that is listed in the xml file?
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Sample Company Name AdminContact@<sample company domain>.com'}
xml_text = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include& Istart=0&count=40&output=atom', headers=headers).text.lower()
soup = BeautifulSoup(xml_text, 'xml')
for e in soup.select('entry'):
link = e.link['href']
print(link)
CodePudding user response:
For example, outputting print(link[0]) does not return the first link, but returns h h h h
This is expected because link
is only ever a single URL string, so link[0]
is the first character of that, an "h".
If you want to collect all of the links in a list, change this code
for e in soup.select('entry'):
link = e.link['href']
print(link)
to something like
links = [e.link['href'] for e in soup.select('entry')]
Then you can access the first link with your index notation, e.g.
print(links[0])
# https://www.sec.gov/archives/edgar/data/1718405/000171840522000045/0001718405-22-000045-index.htm
Alternatively, you could do something like:
link = None
for e in soup.select('entry'):
link = e.link['href']
break
print(link)
which would begin to walk the parsed XML but break after the first entry.
CodePudding user response:
Have you considered using the select_one()
method?
As you say, you only need the first match, so your code would look like this:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Sample Company Name AdminContact@<sample company domain>.com'}
xml_text = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include& Istart=0&count=40&output=atom', headers=headers).text.lower()
soup = BeautifulSoup(xml_text, 'xml')
url = soup.select_one('entry').link['href']
print(url)