Home > database >  Basic question about parsing html using bs4 in python
Basic question about parsing html using bs4 in python

Time:02-13

I have a probably simple question about bs4 that I can't seem to figure out.

And for reference I am self-taught and am troubleshooting my way through learning python.

So essentially a chunk of a bigger project I'm working on requires me to scrape a website to get the most up to date rate of a 1 month T-bill. I was able to get 99% of it down, but one aspect of it I'm stuck on.

Essentially this data only updates mon-fri. And running this code say at 8 am before the site has been updated for the day or on the weekend returns an error. When using a date that has been updated I am able to get the exact data I need.

So I have set variables d1, d2 and d3 as today, yesterday, and two day's ago. I want to use my soup.find to search for today, and if none search for yesterday, and then two days ago.

In my code if I use text=d3, for example, I get a value returned.

Here's what I have right now, would really appreciate some help!

from bs4 import BeautifulSoup
import requests
from datetime import date
import datetime

today = date.today()
d1 = today.strftime("%B %d, %Y")
ndays1 = datetime.timedelta(days = 1)
d2 = (today-ndays1).strftime("%B %d, %Y")
ndays2 = datetime.timedelta(days = 2)
d3 = (today-ndays2).strftime("%B %d, %Y")
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/71.0.3578.98 Safari/537.36',
    'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'DNT': '1',  # Do Not Track Request Header
    'Connection': 'close'
}

url_rfr = "https://ycharts.com/indicators/1_month_treasury_rate"

response = requests.get(url_rfr, headers=headers, timeout=5).text
soup = BeautifulSoup(response, 'html.parser')

div = soup.find("td", text=d1 or d2 or d3).find_next_sibling("td").text.strip()

r = (float(div[:-1]))

print(r)

CodePudding user response:

So, I changed the text in find(...) to "Last Value" and also added latest_period scrape for completeness

import datetime
from datetime import date

import requests
from bs4 import BeautifulSoup

today = date.today()
d1 = today.strftime("%B %d, %Y")
ndays1 = datetime.timedelta(days=1)
d2 = (today - ndays1).strftime("%B %d, %Y")
ndays2 = datetime.timedelta(days=2)
d3 = (today - ndays2).strftime("%B %d, %Y")

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/71.0.3578.98 Safari/537.36',
    'Accept': 'text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'DNT': '1',  # Do Not Track Request Header
    'Connection': 'close'
}

url_rfr = "https://ycharts.com/indicators/1_month_treasury_rate"

response = requests.get(url_rfr, headers=headers, timeout=5).text

soup = BeautifulSoup(response, 'html.parser')
latest_period = soup.find("td", text="Latest Period").find_next_sibling("td").text.strip()
value = soup.find("td", text="Last Value").find_next_sibling("td").text.strip()

val = (float(value[:-1]))

print(latest_period, val)  # Feb 11 2022 0.03
  • Related