I'm making some exercises to practice web scraping using python and I would like to get the values of the first row ("Total Revenue") of the table of this yahoo page:
https://finance.yahoo.com/quote/BAC/financials?p=BAC
Looking at the page source, my idea is to find the first occurrence of <div data-test="fin-row">
and get the values but I'm not sure how to navigate inside the first div.
Below I present the HTML code where the first row is presented:
<div data-test="fin-row">
<div >
<div >
<div title="Total Revenue">
<button aria-label="Total Revenue" >
<svg width="16" style="stroke-width:0;vertical-align:bottom" height="16" viewBox="0 0 48 48" data-icon="caret-right">
<path d="M33.447 24.102L20.72 11.375c-.78-.78-2.048-.78-2.828 0-.78.78-.78 2.047 0 2.828l9.9 9.9-9.9 9.9c-.78.78-.78 2.047 0 2.827.78.78 2.047.78 2.828 0l12.727-12.728z"></path>
</svg>
</button>
<span >Total Revenue</span>
</div>
<div ></div>
</div>
<div data-test="fin-col"><span>90,742,000</span></div>
<div data-test="fin-col"><span>89,113,000</span></div>
<div data-test="fin-col"><span>85,528,000</span></div>
<div data-test="fin-col"><span>91,244,000</span></div>
<div data-test="fin-col"><span>91,247,000</span></div>
</div>
<div></div>
In my code I'm using Selenium to process the page. Not sure if its the best way but with other libraries like urlopen I wasn't able to see the HTML content. I'm able to open the page, click the accept button, but after that I'm not sure how to navigate inside the first div. I'm actually getting an error like: "AttributeError: 'NoneType' object has no attribute 'get_text'"
import requests
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
url="https://finance.yahoo.com/quote/BAC/financials?p=BAC"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
#Click accept button
aceitar = driver.find_element(By.NAME, "agree")
aceitar.click()
#Find the div of the Revenue row <div data-test="fin-row">
primeiraLinha = soup.find("div", {"class":""})
print(primeiraLinha.get_text())
BTW, I think Selenium make this process very slow.
CodePudding user response:
To get the total revenue,you can try the next example. The class value is empty,so you can select attr data-test="fin-row
.
Table data is static that's why you can extract the desired data using requests and bs4
packages
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'}
r = requests.get('https://finance.yahoo.com/quote/BAC/financials?p=BA',headers =headers)
#print(r)
soup = BeautifulSoup(r.text,'html.parser')
for row in soup.select('[data-test="fin-row"]')[0:1]:
total_revenue = row.select_one('div[] > span').text
print(total_revenue)
Output:
90,742,000
CodePudding user response:
Here is a selenium solution to get the entire table in a pandas dataframe.
Imports required
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
Start web driver
# Replace your CHROME DRIVER path here
chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)
Fetch the page
driver.get('https://finance.yahoo.com/quote/BAC/financials?p=BAC')
Wait for table to load
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@]')))
Get the header row
headers_elem = driver.find_elements(By.XPATH, '//div[@]/div/div')
col_headers = [header.text for header in headers_elem]
df = pd.DataFrame(columns = col_headers)
df
Empty DataFrame
Columns: [Breakdown, TTM, 12/30/2021, 12/30/2020, 12/30/2019, 12/30/2018]
Index: []
Get the rows from the table
Here each row in the table is stored in rows
rows = driver.find_elements(By.XPATH, '//div[@]//div[@data-test="fin-row"]')
for row in rows:
row_values = row.find_elements(By.XPATH, 'div/div')
df.loc[len(df)] = [row_value.text for row_value in row_values]
Output :
which gives us the expected output
Breakdown | TTM | 12/30/2021 | 12/30/2020 | 12/30/2019 | 12/30/2018 | |
---|---|---|---|---|---|---|
0 | Total Revenue | 90,742,000 | 89,113,000 | 85,528,000 | 91,244,000 | 91,247,000 |
1 | Credit Losses Provision | 560,000 | 4,594,000 | -11,320,000 | -3,590,000 | -3,282,000 |
2 | Non Interest Expense | 59,763,000 | 59,731,000 | 55,213,000 | 54,900,000 | 53,381,000 |
3 | Special Income Charges | - | - | - | - | 0 |
4 | Pretax Income | 31,539,000 | 33,976,000 | 18,995,000 | 32,754,000 | 34,584,000 |
5 | Tax Provision | 3,521,000 | 1,998,000 | 1,101,000 | 5,324,000 | 6,437,000 |
6 | Net Income Common Stockholders | 26,565,000 | 30,557,000 | 16,473,000 | 25,998,000 | 26,696,000 |
7 | Diluted NI Available to Com Stockholders | 26,565,000 | 30,557,000 | 16,473,000 | 25,998,000 | 26,696,000 |
8 | Basic EPS | - | 3.60 | 1.88 | 2.77 | 2.64 |
9 | Diluted EPS | - | 3.57 | 1.87 | 2.75 | 2.61 |
10 | Basic Average Shares | - | 8,493,300 | 8,753,200 | 9,390,500 | 10,096,500 |
11 | Diluted Average Shares | - | 8,558,400 | 8,796,900 | 9,442,900 | 10,236,900 |
12 | INTEREST_INCOME_AFTER_PROVISION_FOR_LOAN_LOSS | 47,080,000 | 47,528,000 | 32,040,000 | 45,301,000 | 44,150,000 |
13 | Net Income from Continuing & Discontinued Operation | 28,018,000 | 31,978,000 | 17,894,000 | 27,430,000 | 28,147,000 |
14 | Normalized Income | 28,018,000 | 31,978,000 | 17,894,000 | 27,430,000 | 28,147,000 |
15 | Total Money Market Investments | 348,000 | -90,000 | 903,000 | 4,843,000 | 3,176,000 |
16 | Reconciled Depreciation | 1,953,000 | 1,898,000 | 1,843,000 | 1,729,000 | 2,063,000 |
17 | Net Income from Continuing Operation Net Minority Interest | 28,018,000 | 31,978,000 | 17,894,000 | 27,430,000 | 28,147,000 |
18 | Total Unusual Items Excluding Goodwill | - | - | - | - | 0 |
19 | Total Unusual Items | - | - | - | - | 0 |
20 | Tax Rate for Calcs | 0 | 0 | 0 | 0 | 0 |
21 | Tax Effect of Unusual Items | 0 | 0 | 0 | 0 | 0 |
TL:DR
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"
s = Service(chrome_path)
driver = webdriver.Chrome(service=s)
driver.get('https://finance.yahoo.com/quote/BAC/financials?p=BAC')
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@]')))
headers_elem = driver.find_elements(By.XPATH, '//div[@]/div/div')
col_headers = [header.text for header in headers_elem]
df = pd.DataFrame(columns = col_headers)
rows = driver.find_elements(By.XPATH, '//div[@]//div[@data-test="fin-row"]')
for row in rows:
row_values = row.find_elements(By.XPATH, 'div/div')
df.loc[len(df)] = [row_value.text for row_value in row_values]
Result is stored in df