BeautifulSoup saying tag has no attributes while looking for sibling or parent tags-CodePudding

I'm attempting to extract the tax bill for the following web page (http://www.sarasotataxcollector.com/ecomm/proc.php?r=eBillingInvitation&a=0429120051).

The tax bill is the $8,084.54 value directly following the Taxes & Assessments string.

I need to use some static object to go off of because the code will be working over multiple pages.

The "Taxes & Assessments" string is a constant between all pages and always precedes the full tax bill, while the tax bill changes between pages.

My thought was that I could find the "Taxes & Assessment" string, then traverse the BeautifulSoup tree and find the Tax Bill. This is my code:

soup = BeautifulSoup(html_content,'html.parser') #Soupify the HTML content

tagTandA = soup.body.find(text = "Taxes & Assessments")

taxBill = tagTandA.find_next_sibling.text

This returns an error of:

AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

In fact, any fcn of parent, next_sibling, find_next_sibling, or anything of the sorts returns this object has no attribute error.

I have tried looking for other explicit text, just to test that it's not this specific text that is giving me an issue, and the no attribute error is still thrown.

When running just the following code, it returns "None":

tagTandA = soup.body.find(text = "Taxes & Assessments")

How can I find the "Taxes & Assessments" tag in order to navigate the tree to find and return the Tax Bill?

CodePudding user response：

If I'm not mistaken you're trying to use a requests & bs based solution to scrape a (very) JS heavy website, with a redirect, and some iframes.

I don't think it will work.

Here is one way of getting that information (you can improve that hardcoded wait if you want, I just didn't have time to fiddle) using Selenium (there are some unused imports, you can get rid of them):

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")

webdriver_service = Service("chromedriver_linux64/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)

url = 'http://www.sarasotataxcollector.com/ecomm/proc.php?r=eBillingInvitation&a=0429120051'
driver.get(url)
t.sleep(15)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[@name="body"]')))
total_taxes = wait.until(EC.element_to_be_clickable((By.XPATH, "//font[contains(text(), 'Taxes & Assessments')]/ancestor::td/following-sibling::td")))
print('Tax bill: ', total_taxes.text)

Result in terminal:

Tax bill:  $8,084.54

See Selenium documentation for more details.

CodePudding user response：

A very kind person (u/commandlineuser) answered the question in BS code here: https://www.reddit.com/r/learnpython/comments/10hywbs/beautifulsoup_saying_tag_has_no_attributes_while/

Here's said code:

import re
import requests
from   bs4 import BeautifulSoup

url = ""

r1 = requests.get(url)
soup1 = BeautifulSoup(r1.content, "html.parser")

base = r1.url[:r1.url.rfind("/")   1]
href1 = soup1.find("frame").get("src")

r2 = requests.get(base   href1)
soup2 = BeautifulSoup(
   r2.content
     .replace(b"<!--", b"") # basic attempt at stripping comments
     .replace(b"-->", b""),
   "html.parser"
)

href2 = soup2.find("voicemax").get_text(strip=True)
r3 = requests.get(base   href2)
soup3 = BeautifulSoup(r3.content, "html.parser")

total = (
    soup3.find(text=re.compile("Taxes & Assessments"))
         .find_next()
         .get_text(strip=True)
)

print(total)