Home > Software design >  Finding <p style class using BeautifulSoup
Finding <p style class using BeautifulSoup

Time:12-16

I am trying to scrape MSFT's income statement using code I found here: How to Web scraping SEC Edgar 10-K Dynamic data

They use the 'span' class to narrow the search. I do not see a span, so I am trying to use the <p class with no luck.

Here is my code, it is largely unchanged from the answer given. I changed the base_url and tried to change soup.find to 'p'. Is there a way to find the <p class or, even better, a way to find the income statement chart?

Here is the URL to the statement: https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm

from bs4 import BeautifulSoup
import requests


headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text


soup = BeautifulSoup(edgar_str, 'html.parser')
s =  soup.find('p', recursive=True, string='INCOME STATEMENTS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
    if tr.text:
        print(list(tr.stripped_strings))

Here is the code from the example:

from bs4 import BeautifulSoup
import requests


headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text


soup = BeautifulSoup(edgar_str, 'html.parser')
s =  soup.find('span', recursive=True, string='SALES BY SEGMENT OF BUSINESS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
    if tr.text:
        print(list(tr.stripped_strings))

Thank you!

CodePudding user response:

I'm not sure why that's not working, but you can try this:

s = soup.find('a', attrs={'name':'INCOME_STATEMENTS'})

This should match the <a name="INCOME_STATEMENTS"></a> element inside that paragraph.

  • Related