I'm scraping data from this site, where I wanted to extract a tags hyperlinks data from the collapsible content under Selected Filings section. From my code, I used find_all() [divs] with the [id] of selected-filings-annualOrQuarterly, to first select the div which they are found in.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.sec.gov/edgar/browse/?CIK=0001084869", headers={'user-agent': 'Mozilla/5.0'}) soup = BeautifulSoup(r.content, 'html.parser')
print(r.status_code)
print(r.url)
for div_tag in soup.find_all('div', {"id" : "selected-filings-annualOrQuarterly"}):
print(div_tag)
for ul_tag in div_tag.find_all('ul'):
print(i)
for li_tag in ul_tag.find_all('li'):
print(li_tag)
for a_tag in li_tag.find_all('a', href=True):
print(a_tag)
These are the results I get:
200
https://www.sec.gov/edgar/browse/?CIK=0001084869
-<div id="selected-filings-annualOrQuarterly"->
No 10-K/10-Q filings for this company in last
<span id="selected-filings-annualOrQuarterly-days-old"-><-/span-> days<-p> <-/p>
<-/div>
Whenever, I run the above code, I just get [span] elements inside the [div - id] and nothing more. What I really wanted is to obtain all the links (a tags) present in the specified [div] element, inside [li] [ul] elements. When I view the page source of the said website, everything is there. But the code returns NOT even the [ul] and the [li] tags inside the [div]. It's seems like they are hidden. This is the order and location of the a tag hyperlinks. id(specified above) > ul > li > a(links to be scraped)
Results I expected:
<a href='ix?doc=/Archives/edgar/data/1084869/000143774921025463/flws20210926_10q.htm'->
<a href='ix?doc=/Archives/edgar/data/1084869/000143774921025463/flws20210926_10q.htm'->
Depending on the number of links found, the number of links returned will vary. How can I be able to obtain the hyperlinks inside the said location?
CodePudding user response:
When I view the page source of the said website
the inspector is not the source, that's the modified page. the source is ctrl u
and you'll see that your data is not here.
the data you want is in a json that you can see in the network tab: https://data.sec.gov/submissions/CIK0001084869.json
in filings > recent > primaryDocument all files that end with _10q.htm
or _10k.htm
import requests
headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
data=requests.get('https://data.sec.gov/submissions/CIK0001084869.json', headers=headers).json()
base_url='https://sec.gov/ix?doc=/Archives/edgar/data/1084869'
for i, fname in enumerate(data['filings']['recent']['primaryDocument']):
if fname.endswith('_10q.htm') or fname.endswith('_10k.htm'):
access_number=data['filings']['recent']['accessionNumber'][i]
access_number=''.join(access_number.split('-'))
print('/'.join([base_url, access_number, fname]))