Home > Software engineering >  How to scrape all values from a table like HTML DIV structure without missing some of them?
How to scrape all values from a table like HTML DIV structure without missing some of them?

Time:02-05

Im just 3 months into learning python and I run into a little problem while building a Finance Yahoo web Scraper.

import pandas as pd
from bs4 import BeautifulSoup
import lxml
import requests
import openpyxl



index = 'MSFT'

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }

url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT'

read_data = requests.get(url,headers=headers, timeout=5)
content = read_data.content
soup_is = BeautifulSoup(content,'lxml')





ls = []
for l in soup_is.find_all('div') and soup_is.find_all('span'):
    ls.append(l.string)


new_ls = list(filter(None,ls))
new_ls = new_ls[45:]

is_data = list(zip(*[iter(new_ls)]*6))
Income_st = pd.DataFrame(is_data[0:])
print(Income_st)

Everything goes smoothly when I noticed that the content of rows "Diluted EPS" and "Basic EPS" weren't copied. While inspecting the source code ive noticed that the EPS values are stored in the div tag if I can say it like that? Instead of the <span>"Value"</span> underneath it.

<div  data-test="fin-col">**<span>39,240,000</span>**</div>

<div  data-test="fin-col"**>9.70<**/div>

Any idea on how I can fix the code to get those values out? Also any idea how I can extract data separately on two different pages "Annually" and "Quartely"?

I was trying to change the tags, attributes etc but with no avail. :(

CodePudding user response:

To extract the EPS values, you can try modifying your code to search for the div tag with class "Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)" that contains the EPS values you're interested in, and extract the span tag within. Here's an example:

eps_values = []
eps_divs = soup_is.find_all('div', {'data-test': 'fin-col', 'class': 'Ta(c) Py(6px) Bxz(bb) BdB Bdc($seperatorColor) Miw(120px) Miw(100px)--pnclg Bgc($lv1BgColor) fi-row:h_Bgc($hoverBgColor) D(tbc)'})
for div in eps_divs:
    eps_value = div.find('span').string
    eps_values.append(eps_value)

print(eps_values)

Regarding extracting data from different pages, you can change the URL in your requests.get the call to the URL of the desired page, then process the data as you did for the original page. Here's an example for the "Annually" page:

url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT&annual'
read_data = requests.get(url,headers=headers, timeout=5)
content = read_data.content
soup_is = BeautifulSoup(content,'lxml')


CodePudding user response:

Try to select your elements more specific and use stripped_strings in this case to extract the infos from the data rows:

[e.stripped_strings for e in soup.select('[data-test="fin-row"]')]

and the columns:

soup.select_one('div:has(>[data-test="fin-row"])').previous_sibling.stripped_strings

Example

import pandas as pd
from bs4 import BeautifulSoup
index = 'MSFT'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }

url = 'https://finance.yahoo.com/quote/MSFT/financials?p=MSFT'

soup = BeautifulSoup(requests.get(url,headers=headers, timeout=5).text)

pd.DataFrame(
    [e.stripped_strings for e in soup.select('[data-test="fin-row"]')],
    columns=soup.select_one('div:has(>[data-test="fin-row"])').previous_sibling.stripped_strings
)

Output

Breakdown ttm 6/30/2022 6/30/2021 6/30/2020 6/30/2019
0 Total Revenue 204,094,000 198,270,000 168,088,000 143,015,000 125,843,000
1 Cost of Revenue 64,984,000 62,650,000 52,232,000 46,078,000 42,910,000
2 Gross Profit 139,110,000 135,620,000 115,856,000 96,937,000 82,933,000
3 Operating Expense 56,295,000 52,237,000 45,940,000 43,978,000 39,974,000
4 Operating Income 82,815,000 83,383,000 69,916,000 52,959,000 42,959,000
5 Net Non Operating Interest Income Expense 423,000 31,000 -215,000 89,000 76,000
6 Other Income Expense -650,000 302,000 1,401,000 -12,000 653,000
7 Pretax Income 82,588,000 83,716,000 71,102,000 53,036,000 43,688,000
8 Tax Provision 15,139,000 10,978,000 9,831,000 8,755,000 4,448,000
9 Net Income Common Stockholders 67,449,000 72,738,000 61,271,000 44,281,000 39,240,000
10 Diluted NI Available to Com Stockholders 67,449,000 72,738,000 61,271,000 44,281,000 39,240,000
11 Basic EPS - 9.70 8.12 5.82 5.11
12 Diluted EPS - 9.65 8.05 5.76 5.06
13 Basic Average Shares - 7,496,000 7,547,000 7,610,000 7,673,000
14 Diluted Average Shares - 7,540,000 7,608,000 7,683,000 7,753,000
...
26 Net Income from Continuing Operation Net Minority Interest 67,449,000 72,738,000 61,271,000 44,281,000 39,240,000
27 Total Unusual Items Excluding Goodwill -547,000 334,000 1,303,000 28,000 710,000
28 Total Unusual Items -547,000 334,000 1,303,000 28,000 710,000
29 Normalized EBITDA 99,314,000 99,905,000 83,831,000 68,395,000 57,346,000
30 Tax Rate for Calcs 0 0 0 0 0
31 Tax Effect of Unusual Items -100,269 43,420 182,420 4,620 72,420
  • Related