Why beautifulsoup4 doesn't find strings inside nested tags?-CodePudding

I want to grab the string "AKA" from a listing site, but the find_all function fails to return any values.

import requests
from bs4 import BeautifulSoup

# Set the URL you want to scrape
url = 'https://classified.azcentral.com/azcentral-marketplace/category/Legals/Maricopa County'

# Use requests to get the contents
r = requests.get(url)

# Get the text of the contents
html_content = r.text

# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, "html.parser")

# Find all the sections containing the string "SHERIFF'S NOTICE OF SALE OF REAL PROPERTY"
sections = soup.find_all(string="NOTICE OF SALE")
print(sections)

And I searched through previous answers and tried implementing their solutions for about an hour but none worked so far. I've tried the string documentation but perhaps I do not understand.

I expect there to be 15 of the "AKA" strings but zero show up no matter what I do. Python3 on ubuntu 18.04

CodePudding user response：

The following is one way of getting that information:

import requests
from bs4 import BeautifulSoup as bs
import re

headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
          }

r = requests.get('https://classified.azcentral.com/azcentral-marketplace/category/Legals/Maricopa County', headers=headers)
soup = bs(r.text, 'html.parser')
infos = soup.find_all(text=re.compile('AKA: '))
print('total listings', len(infos))
for i in infos:
    if not 'NOTICE TO JUDGMENT' in i:
        print(i.split('AKA: ')[1].split('NOW, THEREFORE, PUBLIC NOTICE')[0])
    else:
        print(i.split('AKA: ')[1].split('NOTICE TO JUDGMENT')[0])

Result in terminal:

total listings 16
12217 W Chase Ln, Avondale, Arizona.  
1905 North 77th Avenue, Phoenix, Arizona 85035.  
15785 W Calavar Road, Surprise, Arizona.  
2911 East Michigan Avenue, Phoenix, AZ 85032  
4438 West Saint Kateri Drive, Laveen, Arizona 85339.  
9694 East Ironwood Drive, Scottsdale, AZ 85258  
4055 East Blanche Drive, Phoenix, AZ 85032  
11145 E Sombra Avenue, Mesa, Arizona.  
18427 West Vogel Avenue, Waddell, Arizona 85355. 
18399 N. 59th Drive, Glendale, AZ. 
23922 West Desert Bloom Street Buckeye, AZ 85326. 
384 East Nunneley Road, Gilbert, Arizona 85296. 
10609 West Monte Vista Road, Avondale, Arizona 85392. 
8520 West Palm Lane, Unit #1108, Phoenix, Arizona 85037. 
13603 W. Catalina Drive, Avondale, AZ 85392. 
3300 South Nash Way Chandler, Arizona 85249.

BeautifulSoup documentation is quite comprehensive.

CodePudding user response：

As shown in other answers there are different approaches to get your goal, lets take a look:

Beautiful Soup will find all tags whose .string matches your value for string.

Cause it is an exact match, it won't work in your case and you have to regex:

import re
soup.find(string=re.compile("AKA:"))

or in alternative css selectors with pseudo class :-soup-contains() (for both be specific as possible) here focused on <span> with class description:

soup.select('.description:-soup-contains("AKA:")')

Get your goal without import of re based on pattern in text:

import requests
from bs4 import BeautifulSoup
url = 'https://classified.azcentral.com/azcentral-marketplace/category/Legals/Maricopa County'
soup = BeautifulSoup(requests.get(url).text)

[
    e.text.split('AKA: ')[1].split('NO')[0].strip()
    for e in soup.select('.description:-soup-contains("AKA:")')
]

Get your goal with importing re:

import requests, re
from bs4 import BeautifulSoup
url = 'https://classified.azcentral.com/azcentral-marketplace/category/Legals/Maricopa County'
soup = BeautifulSoup(requests.get(url).text)
re.findall(r'AKA: (.*?)(?=\s*\b[A-Z]{2}|$)', soup.text)

Both will give you a list:

['12217 W Chase Ln, Avondale, Arizona.',
 '1905 North 77th Avenue, Phoenix, Arizona 85035.',
 '15785 W Calavar Road, Surprise, Arizona.',
 '2911 East Michigan Avenue, Phoenix, AZ 85032',
 '4438 West Saint Kateri Drive, Laveen, Arizona 85339.',
 '9694 East Ironwood Drive, Scottsdale, AZ 85258',
 '4055 East Blanche Drive, Phoenix, AZ 85032',
 '11145 E Sombra Avenue, Mesa, Arizona.',
 '18427 West Vogel Avenue, Waddell, Arizona 85355.',
 '18399 N. 59th Drive, Glendale, AZ.',
 '23922 West Desert Bloom Street Buckeye, AZ 85326.',
 '384 East Nunneley Road, Gilbert, Arizona 85296.',
 '10609 West Monte Vista Road, Avondale, Arizona 85392.',
 '8520 West Palm Lane, Unit #1108, Phoenix, Arizona 85037.',
 '13603 W. Catalina Drive, Avondale, AZ 85392.',
 '3300 South Nash Way Chandler, Arizona 85249.']

CodePudding user response：

Using find_all() with string only searches children who are direct descendants of that tag. You could broaden it to find <div> tags with any children which mention the phrase of interest, but the problem with that is that it will also match the <div> which contains the entire page.

Instead, I would suggest using CSS classes. Looking at the HTML of that page, the .panel-body class shows up on each ad. This code searches for all matches for .panel-body:

for section in soup.find_all("div", class_="panel-body"):
    print(section.text.strip()[:80])  # print just the first 80 characters of each match

Output:

MarketPlace is where you can find anything you need! Simply choose a category fo
MARICOPA COUNTY NOTICE OF CALL FOR BIDS   NOTICE IS HEREBY GIVEN that sealed bid
CV2021-051400 C22011672 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
NO. PB2016-051918 NOTICE OF INITIAL HEARING  REGARDING: PETITION FOR  APPROVAL O
CV2022-003436 C22011714 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
CV2021011535 C22011653 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXEC
CV2022-091920 C22011708 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
CV2020-055896 C22011668 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
CV2022-050418 C22011669 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
CV2020-009284 C22011711 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
CV2021-014484 C22011666 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
NO. PB2022-050058 NOTICE TO CREDITORS (PUBLICATION) (Assigned to Honorable Vanes
CV2021015245 C22011660 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXEC
Case No. PB1992-004227 NOTICE OF INITIAL HEARING  REGARDING: PETITION TO  TERMIN
Case No. PB2020-005222 NOTICE OF INITIAL HEARING  REGARDING: PETITION TO  TERMIN
Case No. PB2020-000142 NOTICE OF INITIAL HEARING  REGARDING:PETITION TO  TERMINA
Case No. PB2021-005139 NOTICE OF INITIAL HEARING  REGARDING: PETITION TO  TERMIN
CV2022-010475 C22011118 SHERIFF'S NOTICE OF SALE OF REAL ESTATE ON EXECUTION  IN
Case No. PB2022-005749 NOTICE OF INITIAL HEARING  REGARDING: PETITION FOR  APPOI
CV2022-001756 C22010874 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
CV2022-001946 C22010896 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
Case No. PB2015-003466 NOTICE OF INITIAL HEARING  REGARDING: PETITION TO  TERMIN
Case No. PB2016-001049 NOTICE OF INITIAL HEARING  REGARDING: PETITION FOR APPROV
CV2021-093163 C22010867 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
CV2022-051687 C22010863 SHERIFF'S NOTICE OF SALE OF REAL PROPERTY ON SPECIAL EXE
Case No. PB2022-005813 NOTICE OF INITIAL HEARING  REGARDING: PETITION FOR APPOIN

Hmmm, that looks mostly right, except for the first one. There's a piece of text up top which also uses the same CSS class. You can filter that out by always dropping the first match:

for section in soup.find_all("div", class_="panel-body")[1:]:
    print(section.text.strip()[:80])

Or you can leave it. The next step will get rid of it anyway.

Next, you only care about the ones which have a "NOTICE OF SALE" in them.

for section in soup.find_all("div", class_="panel-body"):
    if "NOTICE OF SALE" in section.text:
        print(section.text.strip()[:80])

Next, you probably want to save the full ad as a string.

notice_of_sale_ads = []
for section in soup.find_all("div", class_="panel-body"):
    if "NOTICE OF SALE" in section.text:
        notice_of_sale_ads.append(section.text.strip())

When I run this, I get 14 matches. (Slightly different from the 15 you expected, but I get the same number in a browser.)