How do I pick out millions from a string of numbers and commas?-CodePudding

I was using beautifulsoup to do some web scraping on Yahoo Finance for fun. The goal is to take the html file, find the financial data and place it into an array. I've managed to get the output to this format

Total Revenue42,965,39136,483,93920,139,65822,588,85825,067,279

How would I split the numbers into millions? For example we know 42,965,39136,483,939 is actually 42,965,391 and 36,483,939 but how do we code for this? I've tried using regex without success.

with open('Nucor Yahoo HTML.html','r') as html:
content = html.read()
soup = BeautifulSoup(content, 'lxml')
tables = soup.find_all(class_ = 'rw-expnded')
for table in tables:
    pattern = re.compile(r'\d\d\d?,[0-9]{3},[0-9]{3}')
    matches = pattern.finditer(table.text)
    for match in matches:
        print(match)
    print(table.text)

the html file is here:https://finance.yahoo.com/quote/NUE/financials?p=NUE

CodePudding user response：

I'd suggest to change the code you used to extract the data and use this instead (which I feel like is a lot safer than trying to get the right cut offs in a string...):

get data:

import requests
from bs4 import BeautifulSoup
resp = requests.get("https://finance.yahoo.com/quote/NUE/financials?p=NUE",
                    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'})

parse data and select "Total Revenue" row

soup = BeautifulSoup(resp.text, "html.parser")
total_revenue = [row for row in soup.find("div", {"data-test": "fin-row"}) if "Total Revenue" in row.text]

now you can select your columns and work with them

columns = total_revenue[0].find_all("div", {"data-test": "fin-col"})
for col in columns:
    print(col.text)

output:

CodePudding user response：

If you are sure of the format you have, the simplest I found is:

s = "42,965,39136,483,93920,139,65822,588,85825,067,279"
end = prev_end = 0
while end < len(s):
    # find next comma position. Milion end is 8 positions later
    end = s.index(",", end)   8
    my_milion = s[prev_end:end]
    print(my_milion) # or int(my_milion.replace(",", "")) if you want integer
    prev_end = end

CodePudding user response：

The string can be parsed by just using splits and join and some filtering:

origString = 'Total Revenue42,965,39136,483,93920,139,65822,588,85825,067,279'

cols = origString.split(',')

rowName = cols[0][:] #copy
cols[0] = ''.join(e for e in cols[0] if e.isdigit())
rowName = rowName.replace(cols[0], '')

cols = ','.join([c if len(c) < 4 else (c[:3] '|' c[3:]) for c in cols])
colVals = [rowName]   cols.split('|')

and printing colVals will give you the output

['Total Revenue', '42,965,391', '36,483,939', '20,139,658', '22,588,858', '25,067,279']

The above method works as long as the commas are always supposed to separate blocks of 3 digits AND there is no more than one text section - which must always be at the beginning and not contain any numbers or commas. (Not having a text section would be fine - it just ends up with rowName='', and the first item of colVals will also be '').

HOWEVER, I think the best approach is to retrieve the text from the soup in ways so that it doesn't need further parsing at all:

tables = soup.find_all(class_ = 'fi-row')

print(len(tables), 'rows found:')
print('Breakdown       ||          TTM |   12/31/2021 |   12/31/2020 |   12/31/2019 |   12/31/2018')
print('-------------------------------------------------------------------------------------------')
for table in tables:
  cols = [s.text for s in table.find_all('div',{'data-test':'fin-col'})]
  rowName = table.find('span').text
  colText = ' | '.join([f'{c:>12}' for c in cols])
  nameTxt = rowName if len(rowName) < 15 else f'{rowName[:12]}...'
  print(f'{nameTxt:15} || {colText}')

(colText and nameTxt are defined just for printing - I just didn't like the output looking like this)

Also, I changed the find_all arguments for the rows to class_ = 'fi-row' because both class_ = 'rw-expnded' and "div", {"data-test": "fin-row"} include collapsible rows that are nested within, so when the outer row is being looped through, it ends up seeming to have extra columns; and class_ = 'rw-expnded' sometimes can't be found at all.