Home > Blockchain >  How to Scrape data hidden inside of a collapsible element?
How to Scrape data hidden inside of a collapsible element?

Time:11-07

I'm scraping data from this site, where I wanted to extract a tags hyperlinks data from the collapsible content under Selected Filings section. From my code, I used find_all() [divs] with the [id] of selected-filings-annualOrQuarterly, to first select the div which they are found in.

from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.sec.gov/edgar/browse/?CIK=0001084869", headers={'user-agent': 'Mozilla/5.0'}) soup = BeautifulSoup(r.content, 'html.parser')
print(r.status_code) 
print(r.url)
for div_tag in soup.find_all('div', {"id" : "selected-filings-annualOrQuarterly"}):
    print(div_tag)
    for ul_tag in div_tag.find_all('ul'):
        print(i)
        for li_tag in ul_tag.find_all('li'):
            print(li_tag)
            for a_tag in li_tag.find_all('a', href=True):
                print(a_tag)

These are the results I get:

200
https://www.sec.gov/edgar/browse/?CIK=0001084869
-<div id="selected-filings-annualOrQuarterly"->
No 10-K/10-Q filings for this company in last
<span id="selected-filings-annualOrQuarterly-days-old"-><-/span-> days<-p> <-/p>
<-/div>

Whenever, I run the above code, I just get [span] elements inside the [div - id] and nothing more. What I really wanted is to obtain all the links (a tags) present in the specified [div] element, inside [li] [ul] elements. When I view the page source of the said website, everything is there. But the code returns NOT even the [ul] and the [li] tags inside the [div]. It's seems like they are hidden. This is the order and location of the a tag hyperlinks. id(specified above) > ul > li > a(links to be scraped)

Results I expected:

<a href='ix?doc=/Archives/edgar/data/1084869/000143774921025463/flws20210926_10q.htm'->
<a href='ix?doc=/Archives/edgar/data/1084869/000143774921025463/flws20210926_10q.htm'->

Depending on the number of links found, the number of links returned will vary. How can I be able to obtain the hyperlinks inside the said location?

CodePudding user response:

When I view the page source of the said website

the inspector is not the source, that's the modified page. the source is ctrl u and you'll see that your data is not here.

the data you want is in a json that you can see in the network tab: https://data.sec.gov/submissions/CIK0001084869.json

in filings > recent > primaryDocument all files that end with _10q.htm or _10k.htm

import requests
headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
data=requests.get('https://data.sec.gov/submissions/CIK0001084869.json', headers=headers).json()
base_url='https://sec.gov/ix?doc=/Archives/edgar/data/1084869'
for i, fname in enumerate(data['filings']['recent']['primaryDocument']):
    if fname.endswith('_10q.htm') or fname.endswith('_10k.htm'):
        access_number=data['filings']['recent']['accessionNumber'][i]
        access_number=''.join(access_number.split('-'))
        print('/'.join([base_url, access_number, fname]))
  • Related