Home > Enterprise >  Why is this text attribute breaking my BeautifulSoup function?
Why is this text attribute breaking my BeautifulSoup function?

Time:06-24

Im new with beautifulSoup, so Im practicing my web scraping on this website and the text attribute keeps breaking the .find() function. This is the code:

from bs4 import BeautifulSoup
import requests

url = 'https://montanahistoriclandscape.com/tag/glasgow-montana/'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'lxml')

article = soup.find('article')

first_p = article.find('div', class_='entry-content').p.text
print(first_p)

The code runs fine if I remove the text from the end of the first_p variable; however it gives me the paragraph still in html. But when I add the text it gives me nothing at all as output.

Anyone know whats going on here? I feel like im looking right at it but can't figure it out. Any help would be appreciated!

CodePudding user response:

This is the HTML that is in your first_p variable.

<p><img alt="Valley Co Glasgow courthouse"  data-attachment-id="18200" data-comments-opened="1" data-image-caption="" data-image-description="" data-image-meta='{"aperture":"10","credit":"","camera":"Canon EOS REBEL T2i","caption":"","created_timestamp":"946684800","copyright":"","focal_length":"24","iso":"100","shutter_speed":"0.005","title":"Valley Co Glasgow courthouse","orientation":"1"}' data-image-title="Valley Co Glasgow courthouse" data-large-file="https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=584" data-medium-file="https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=300" data-orig-file="https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg" data-orig-size="5083,2348" data-permalink="https://montanahistoriclandscape.com/2019/03/03/eastern-montana-county-seats-glasgow/valley-co-glasgow-courthouse/" sizes="(max-width: 584px) 100vw, 584px" src="https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=584" srcset="https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=584 584w, https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=1168 1168w, https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=150 150w, https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=300 300w, https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=768 768w, https://carrollvanwest.files.wordpress.com/2019/03/img_7957.jpg?w=1024 1024w"/></p>

There is no text in the p tag, only an image tag.

CodePudding user response:

There are multiple <p> tags inside that <div>, not all of them contain text. You could get all the text as follows:

from bs4 import BeautifulSoup
import requests

url = 'https://montanahistoriclandscape.com/tag/glasgow-montana/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
article = soup.find('article')
div_entry = article.find('div', class_='entry-content')

for p in div_entry.find_all('p'):
    text = p.get_text(strip=True)
    
    if text:    # skip empty lines
        print(text)

Giving you:

It has been five years since I revisited the historic built environment of northeast Montana.  My last posting took a second look at Wolf Point, the seat of Roosevelt County.  I thought a perfect follow-up would be second looks at the different county seats of the region–a part of the Treasure State that I have always enjoyed visiting, and would strongly encourage you to do the same.
Grain elevators along the Glasgow railroad corridor.
Like Wolf Point, Glasgow is another of the county seats created in the wake of the Manitoba Road/Great Northern Railway building through the state in the late 1880s.  Glasgow is the seat of Valley County.  The courthouse grounds include not only the modernist building above from 1973 but a WPA-constructed courthouse annex/ public building from 1939-1940 behind the courthouse.
The understated WPA classic look of this building fits into the architectural legacies of Glasgow.  My first post about the town looked at its National Register buildings and the blending of classicism and modernism.  Here I want to highlight other impressive properties that I left out of the original Glasgow entry.  St. Michael’s Episcopal Church is an excellent late 19th century of Gothic Revival style in Montana.
The town has other architecturally distinctive commercial buildings that document its transition from late Victorian era railroad town to am early 20th century homesteading boom town.
The fact that these buildings are well-kept and in use speaks to the local commitment to stewardship and effective adaptive reuse projects.  As part of Glasgow’s architectural legacy I should have said more about its Craftsman-style buildings, beyond the National
Register-listed Rundle Building.  The Rundle is truly eye-catching but Glasgow also has a Mission-styled apartment row and then its historic Masonic Lodge.
I have always been impressed with the public landscapes of Glasgow, from the courthouse grounds to the city-county library (and its excellent local history collection)
and on to Valley County Fairgrounds which are located on the boundaries of town.
Another key public institution is the Valley County Pioneer Museum, which proudly emphasizes the theme of from dinosaur bones to moon walk–just see its entrance.
The museum was a fairly new institution when I first visited in 1984 and local leaders proudly took me through the collection as a way of emphasizing what themes and what places they wanted to be considered in the state historic preservation plan.  Then I spoke with the community that evening at the museum.  Not surprisingly then, the museum has ever since been a favorite place.  Its has grown substantially in 35 years to include buildings and other large items on a lot adjacent to the museum collections.  I have earlier discussed its collection of Thomas Moleworth furniture–a very important bit of western material culture from the previous town library.  In the images below, I want to suggest its range–from the deep Native American past to the railroad era to the county’s huge veteran story and even its high school band and sports history.
A new installation, dating to the Lewis and Clark Bicentennial of 2003, is a mural depicting the Corps of Discovery along the Missouri River in Valley County.  The mural is signed by artist Jesse W. Henderson, who also identifies himself as a Chippewa-Cree.  The mural is huge, and to adequately convey its details I have divided my images into the different groups of people Henderson interprets in the mural.
The Henderson mural, together with the New Deal mural of the post office/courthouse discussed in my first Glasgow posting (below is a single image of that work by Forrest
Hill), are just two of the reasons to stop in Glasgow–it is one of those county seats where I discover something new every time I travel along U.S. Highway 2.
  • Related