I was using beautifulsoup to do some web scraping on Yahoo Finance for fun. The goal is to take the html file, find the financial data and place it into an array. I've managed to get the output to this format
Total Revenue42,965,39136,483,93920,139,65822,588,85825,067,279
How would I split the numbers into millions? For example we know 42,965,39136,483,939 is actually 42,965,391 and 36,483,939 but how do we code for this? I've tried using regex without success.
with open('Nucor Yahoo HTML.html','r') as html:
content = html.read()
soup = BeautifulSoup(content, 'lxml')
tables = soup.find_all(class_ = 'rw-expnded')
for table in tables:
pattern = re.compile(r'\d\d\d?,[0-9]{3},[0-9]{3}')
matches = pattern.finditer(table.text)
for match in matches:
print(match)
print(table.text)
the html file is here:https://finance.yahoo.com/quote/NUE/financials?p=NUE
CodePudding user response:
I'd suggest to change the code you used to extract the data and use this instead (which I feel like is a lot safer than trying to get the right cut offs in a string...):
get data:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://finance.yahoo.com/quote/NUE/financials?p=NUE",
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'})
parse data and select "Total Revenue" row
soup = BeautifulSoup(resp.text, "html.parser")
total_revenue = [row for row in soup.find("div", {"data-test": "fin-row"}) if "Total Revenue" in row.text]
now you can select your columns and work with them
columns = total_revenue[0].find_all("div", {"data-test": "fin-col"})
for col in columns:
print(col.text)
output:
42,965,391
36,483,939
20,139,658
22,588,858
25,067,279
CodePudding user response:
If you are sure of the format you have, the simplest I found is:
s = "42,965,39136,483,93920,139,65822,588,85825,067,279"
end = prev_end = 0
while end < len(s):
# find next comma position. Milion end is 8 positions later
end = s.index(",", end) 8
my_milion = s[prev_end:end]
print(my_milion) # or int(my_milion.replace(",", "")) if you want integer
prev_end = end
CodePudding user response:
The string can be parsed by just using splits and join and some filtering:
origString = 'Total Revenue42,965,39136,483,93920,139,65822,588,85825,067,279'
cols = origString.split(',')
rowName = cols[0][:] #copy
cols[0] = ''.join(e for e in cols[0] if e.isdigit())
rowName = rowName.replace(cols[0], '')
cols = ','.join([c if len(c) < 4 else (c[:3] '|' c[3:]) for c in cols])
colVals = [rowName] cols.split('|')
and printing colVals
will give you the output
['Total Revenue', '42,965,391', '36,483,939', '20,139,658', '22,588,858', '25,067,279']
The above method works as long as the commas are always supposed to separate blocks of 3 digits AND there is no more than one text section - which must always be at the beginning and not contain any numbers or commas. (Not having a text section would be fine - it just ends up with rowName=''
, and the first item of colVals
will also be ''
).
HOWEVER, I think the best approach is to retrieve the text from the soup in ways so that it doesn't need further parsing at all:
tables = soup.find_all(class_ = 'fi-row')
print(len(tables), 'rows found:')
print('Breakdown || TTM | 12/31/2021 | 12/31/2020 | 12/31/2019 | 12/31/2018')
print('-------------------------------------------------------------------------------------------')
for table in tables:
cols = [s.text for s in table.find_all('div',{'data-test':'fin-col'})]
rowName = table.find('span').text
colText = ' | '.join([f'{c:>12}' for c in cols])
nameTxt = rowName if len(rowName) < 15 else f'{rowName[:12]}...'
print(f'{nameTxt:15} || {colText}')
(colText
and nameTxt
are defined just for printing - I just didn't like the output looking like this)
Also, I changed the find_all
arguments for the rows to class_ = 'fi-row'
because both class_ = 'rw-expnded'
and "div", {"data-test": "fin-row"}
include collapsible rows that are nested within, so when the outer row is being looped through, it ends up seeming to have extra columns; and class_ = 'rw-expnded'
sometimes can't be found at all.