How can I edit web scraped text data using python?-CodePudding

Trying to build my first webscraper to print out how the stock market is doing on Yahoo finance. I have found out how to isolate the information I want but it returns super sloppy. How can I manipulate this data to present in an easier way?

import requests 
from bs4 import BeautifulSoup


#Import your website here
html_text = requests.get('https://finance.yahoo.com/').text

soup = BeautifulSoup(html_text, 'lxml')

#Find the part of the webpage where your information is in
sp_market = soup.find('h3', class_ = 'Maw(160px)').text
print(sp_market)

The return here is : S&P 5004,587.18 65.64( 1.45%)

I want to grab these elements such as the labels and percentages and isolate them so I can print them in a way I want. Anyone know how? Thanks so much!

edit: ((S&P 500
4,587.18 65.64( 1.45%)))

CodePudding user response：

For simple splitting you could use the .split(separator) method that is built-in. (f.e. First split by 'x', then split by 'y', then split by 'z' with x, y, z being seperators). Since this is not efficient and if you have bit more complex regular expressions that look the same way for different elements (here: stocks) then take a look at the python regex module.

string = "Stock  45%"
pattern = '[a-z] [0-9][0-9]'

Then, consider to use a function like find_all oder search.

CodePudding user response：

I assume that the format is always S&P 500\n[number][ /-][number]([ /-][number]%).

If that is the case, we could do the following.

import re 

# [your existing code]

# e.g. 
# sp_market = 'S&P 500\n4,587.18 65.64( 1.45%)'

label,line2 = sp_market.split('\n')
pm = re.findall(r"[ -]",line2)
total,change,percent,_ = re.split(r"[\ \-\(\)%] ",line2)
total = float(''.join(total.split(',')))
change = float(change)
if pm[0]=='-':
    change=-change
percent = float(percent)
if pm[1]=='-':
    percent=-percent

print(label, total,change,percent)
# S&P 500 4587.18 65.64 1.45

CodePudding user response：

Not sure, cause question do not provide an expected result, but you can "isolate" the information with stripped_strings.

This will give you a list of "isolated" values you can process:

list(soup.find('h3', class_ = 'Maw(160px)').stripped_strings)

#Output
['S&P 500', '4,587.18', ' 65.64', '( 1.45%)']

For example stripping following characters "()%":

[x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings]

#Output
['S&P 500', '4,587.18', ' 65.64', ' 1.45']

Simplest way to print the data not that sloppy way, is to join() the values by whitespace:

' '.join([x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])

#Output
S&P 500 4,587.18  65.64  1.45

You can also create dict() and print the key / value pairs:

for k, v in dict(zip(['Symbol','Last Price','Change','% Change'], [x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])).items():
    print(f'{k}: {v}')

#Output
Symbol: S&P 500
Last Price: 4,587.18
Change:  65.64
% Change:  1.45