Im just 3 months into learning python and I run into a little problem while building a Finance Yahoo web Scraper.
import pandas as pd
from bs4 import BeautifulSoup
import lxml
import requests
import openpyxl
index = 'MSFT'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT'
read_data = requests.get(url,headers=headers, timeout=5)
content = read_data.content
soup_is = BeautifulSoup(content,'lxml')
ls = []
for l in soup_is.find_all('div') and soup_is.find_all('span'):
ls.append(l.string)
new_ls = list(filter(None,ls))
new_ls = new_ls[45:]
is_data = list(zip(*[iter(new_ls)]*6))
Income_st = pd.DataFrame(is_data[0:])
print(Income_st)
Everything goes smoothly when I noticed that the content of rows "Diluted EPS" and "Basic EPS" weren't copied.
While inspecting the source code ive noticed that the EPS values are stored in the div tag if I can say it like that? Instead of the <span>"Value"</span>
underneath it.
<div data-test="fin-col">**<span>39,240,000</span>**</div>
<div data-test="fin-col"**>9.70<**/div>
Any idea on how I can fix the code to get those values out? Also any idea how I can extract data separately on two different pages "Annually" and "Quartely"?
I was trying to change the tags, attributes etc but with no avail. :(
CodePudding user response:
To extract the EPS values, you can try modifying your code to search for the div tag with class "Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)"
that contains the EPS values you're interested in, and extract the span
tag within. Here's an example:
eps_values = []
eps_divs = soup_is.find_all('div', {'data-test': 'fin-col', 'class': 'Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)'})
for div in eps_divs:
eps_value = div.find('span').string
eps_values.append(eps_value)
print(eps_values)
Regarding extracting data from different pages, you can change the URL in your requests.get
the call to the URL of the desired page, then process the data as you did for the original page. Here's an example for the "Annually" page:
url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT&annual'
read_data = requests.get(url,headers=headers, timeout=5)
content = read_data.content
soup_is = BeautifulSoup(content,'lxml')
CodePudding user response:
Try to select your elements more specific and use stripped_strings
in this case to extract the infos from the data rows:
[e.stripped_strings for e in soup.select('[data-test="fin-row"]')]
and the columns:
soup.select_one('div:has(>[data-test="fin-row"])').previous_sibling.stripped_strings
Example
import pandas as pd
from bs4 import BeautifulSoup
index = 'MSFT'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT'
soup = BeautifulSoup(requests.get(url,headers=headers, timeout=5).text)
pd.DataFrame(
[e.stripped_strings for e in soup.select('[data-test="fin-row"]')],
columns=soup.select_one('div:has(>[data-test="fin-row"])').previous_sibling.stripped_strings
)
Output
Breakdown | ttm | 6/30/2022 | 6/30/2021 | 6/30/2020 | 6/30/2019 | |
---|---|---|---|---|---|---|
0 | Total Revenue | 204,094,000 | 198,270,000 | 168,088,000 | 143,015,000 | 125,843,000 |
1 | Cost of Revenue | 64,984,000 | 62,650,000 | 52,232,000 | 46,078,000 | 42,910,000 |
2 | Gross Profit | 139,110,000 | 135,620,000 | 115,856,000 | 96,937,000 | 82,933,000 |
3 | Operating Expense | 56,295,000 | 52,237,000 | 45,940,000 | 43,978,000 | 39,974,000 |
4 | Operating Income | 82,815,000 | 83,383,000 | 69,916,000 | 52,959,000 | 42,959,000 |
5 | Net Non Operating Interest Income Expense | 423,000 | 31,000 | -215,000 | 89,000 | 76,000 |
6 | Other Income Expense | -650,000 | 302,000 | 1,401,000 | -12,000 | 653,000 |
7 | Pretax Income | 82,588,000 | 83,716,000 | 71,102,000 | 53,036,000 | 43,688,000 |
8 | Tax Provision | 15,139,000 | 10,978,000 | 9,831,000 | 8,755,000 | 4,448,000 |
9 | Net Income Common Stockholders | 67,449,000 | 72,738,000 | 61,271,000 | 44,281,000 | 39,240,000 |
10 | Diluted NI Available to Com Stockholders | 67,449,000 | 72,738,000 | 61,271,000 | 44,281,000 | 39,240,000 |
11 | Basic EPS | - | 9.70 | 8.12 | 5.82 | 5.11 |
12 | Diluted EPS | - | 9.65 | 8.05 | 5.76 | 5.06 |
13 | Basic Average Shares | - | 7,496,000 | 7,547,000 | 7,610,000 | 7,673,000 |
14 | Diluted Average Shares | - | 7,540,000 | 7,608,000 | 7,683,000 | 7,753,000 |
... | ||||||
26 | Net Income from Continuing Operation Net Minority Interest | 67,449,000 | 72,738,000 | 61,271,000 | 44,281,000 | 39,240,000 |
27 | Total Unusual Items Excluding Goodwill | -547,000 | 334,000 | 1,303,000 | 28,000 | 710,000 |
28 | Total Unusual Items | -547,000 | 334,000 | 1,303,000 | 28,000 | 710,000 |
29 | Normalized EBITDA | 99,314,000 | 99,905,000 | 83,831,000 | 68,395,000 | 57,346,000 |
30 | Tax Rate for Calcs | 0 | 0 | 0 | 0 | 0 |
31 | Tax Effect of Unusual Items | -100,269 | 43,420 | 182,420 | 4,620 | 72,420 |