Hello everyone I'm trying to pull certain text info from a website not all of the text is needed but I'm confused about how to do so when the text is in multiple divs. here is the code I'm looking at. But I get confused when there are multiple rows inside. I need to pull the "Number" title and the text (which is 837270), and the "Location" title and the text which is (Ohio)
<br>
<br>
</p>
</div>
</div>
<div >
<div >
<p>
<span >Number</span>
<br>
"837270"
</p>
</div>
<div >
<p>
<span >Location</span>
<br>
"Ohio"
</p>
</div>
<div >
<p>
<span >Office</span>
<be>
"Joanna"
</p>
</div>
</div>
<div >
<div >
<p>
<span >Date</span>
<be>
"07/01/2022"
</p>
</div>
<div >
<p>
<span >Type</span>
<br>
"Business"
</p>
</div>
<div >
<p>
<span >Status</span>
<br>
"Open"
</p>
</div>
</div>
</div>
</div>
</div>
I've tried this and it prints out none.
soup = BeautifulSoup(driver.page_source,'html.parser')
df = soup.find('div', id = "Location")
print(df.string)
I want to pull it and save it. any help would be appreciated thank you.
CodePudding user response:
Sometimes HTML won't have IDs or other patterns that can be followed easily. You can get pretty clever with this though, you don't have to rely on HTML pages using table structures.
In this case, for example, it appears each section is titled by a <span >
tag and its value is the last sibling of that span tag.
To scrape each of these titles and their values, we can do something like this:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):
# get the last sibling
*_, value_tag = title_tag.next_siblings
title = title_tag.text.strip()
if isinstance(value_tag, bs4.element.Tag):
value = value_tag.text.strip()
else: # it's a navigable string element
value = value_tag.strip()
print(title, value)
Output:
Number "837270"
Location "Ohio"
Office "Joanna"
Date "07/01/2022"
Type "Business"
Status "Open"
There are of course other patterns you could identify here to reliably get the values. This is just one example.
If you wanted to get just the Location
, you could locate it by its text.
location_tag = soup.find('span', class_='text-muted', text='Location')
Then getting its value is the same in the above.
*_, location_value_element = location_tag.next_siblings
print(location_value_element.strip()) # "Ohio"