I am trying to scrape MSFT's income statement using code I found here: How to Web scraping SEC Edgar 10-K Dynamic data
They use the 'span' class to narrow the search. I do not see a span, so I am trying to use the <p class with no luck.
Here is my code, it is largely unchanged from the answer given. I changed the base_url and tried to change soup.find to 'p'. Is there a way to find the <p class or, even better, a way to find the income statement chart?
Here is the URL to the statement: https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm
from bs4 import BeautifulSoup
import requests
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-10k_20180630.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text
soup = BeautifulSoup(edgar_str, 'html.parser')
s = soup.find('p', recursive=True, string='INCOME STATEMENTS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
if tr.text:
print(list(tr.stripped_strings))
Here is the code from the example:
from bs4 import BeautifulSoup
import requests
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = "https://www.sec.gov/Archives/edgar/data/200406/000020040621000057/jnj-20210704.htm"
edgar_resp = requests.get(base_url, headers=headers)
edgar_str = edgar_resp.text
soup = BeautifulSoup(edgar_str, 'html.parser')
s = soup.find('span', recursive=True, string='SALES BY SEGMENT OF BUSINESS ')
t = s.find_next('table')
trs = t.find_all('tr')
for tr in trs:
if tr.text:
print(list(tr.stripped_strings))
Thank you!
CodePudding user response:
I'm not sure why that's not working, but you can try this:
s = soup.find('a', attrs={'name':'INCOME_STATEMENTS'})
This should match the <a name="INCOME_STATEMENTS"></a>
element inside that paragraph.