Home > Back-end >  How do I return the first link in a non-list output
How do I return the first link in a non-list output

Time:07-19

I am attempting to return only the first url that pops up when scraping "https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include&  Istart=0&count=40&output=atom." However, while a list is created when scraping, it is archived incorrectly, as the [0] in the list returns "h", [1] returns "t" and so on.

For example, outputting print(link[0]) does not return the first link, but returns h h h h

How can I make it so I only return the first URL that is listed in the xml file?

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Sample Company Name AdminContact@<sample company domain>.com'}

xml_text = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include&  Istart=0&count=40&output=atom', headers=headers).text.lower()

soup = BeautifulSoup(xml_text, 'xml')

for e in soup.select('entry'):
    link = e.link['href']
    print(link)

CodePudding user response:

For example, outputting print(link[0]) does not return the first link, but returns h h h h

This is expected because link is only ever a single URL string, so link[0] is the first character of that, an "h".

If you want to collect all of the links in a list, change this code

for e in soup.select('entry'):
    link = e.link['href']
    print(link)

to something like

links = [e.link['href'] for e in soup.select('entry')]

Then you can access the first link with your index notation, e.g.

print(links[0])
# https://www.sec.gov/archives/edgar/data/1718405/000171840522000045/0001718405-22-000045-index.htm

Alternatively, you could do something like:

link = None
for e in soup.select('entry'):
    link = e.link['href']
    break

print(link)

which would begin to walk the parsed XML but break after the first entry.

CodePudding user response:

Have you considered using the select_one() method? As you say, you only need the first match, so your code would look like this:

from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Sample Company Name AdminContact@<sample company domain>.com'}
xml_text = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include&  Istart=0&count=40&output=atom', headers=headers).text.lower()
soup = BeautifulSoup(xml_text, 'xml')
url = soup.select_one('entry').link['href']
print(url)
  • Related