Home > Blockchain >  Extracting the last text in <p> tag
Extracting the last text in <p> tag

Time:10-07

I wanted to extract the last text within each drop-down of the list belonging to a webpage. The last text should be an address in this list.

For example:

url = 'https://www.housebeautiful.com/lifestyle/g26859396/movie-homes-you-can-visit/'

soup = BeautifulSoup((requests.get(url)).content, 'lxml')
for i in soup.select('p'):
    print(i.text.strip)

Prints me all the text within the <p> tags example:

Every item on this page was hand-picked by a House Beautiful editor. We may earn commission on some of the items you choose to buy.
BRB planning my summer road trip now.
Movie and TV show fanatics know the immediate thrill of seeing the house or apartment building that appears in their favorite flick or series. It takes that emotional connection to the fantasy world and brings it into a physical space, which is why no one should pass up the opportunity to visit their favorite movie or TV destination if they can. From spending a night in Ralphie's room from A Christmas Story to visiting the museum that is Bruce Wayne's mansion from The Dark Knight Rises, you’ll want to visit more than one of these iconic buildings from movies and TV shows.
Nestled near downtown L.A., the building used for the exterior shots of this Fox sitcom is actually a real-life apartment building you could rent, although I can't guarantee the roommates will be as fun.
836 Traction Avenue, Los Angeles, CA 90013

However I want only:

'836 Traction Avenue, Los Angeles, CA 90013',  
'320 Jefferson St, Natchitoches, LA 71457',  
'1709 Broderick St., San Francisco, CA 94115' ...

Which may be possible by selecting the last text in each <p> tag, in each list?

CodePudding user response:

It looks like the last line, the address, is always preceded by newline character (\n), so you should be able to write the following:

url = 'https://www.housebeautiful.com/lifestyle/g26859396/movie-homes-you-can-visit/'

soup = BeautifulSoup((requests.get(url)).content, 'lxml')
for i in soup.select('p'):
    print(i.text.strip.split('\n')[-1])

CodePudding user response:

If you literally just want to find last p tag, then preform find_all and then use the index to find last item.

For example:

from bs4 import BeautifulSoup

html = '''<div>
<p>first</p>
<p>second</p>
<p>last</p>
</div>'''

soup = BeautifulSoup(html, 'lxml')
paragraphs = soup.find_all('p')
print(paragraphs[len(paragraphs) -1].text.strip())

Output:

last

CodePudding user response:

All addresses are contained in a slideshow-slide-dek <div>, in the last <p>-tag. You can iterate over them and get the last paragraph like this:

import requests
from bs4 import BeautifulSoup

url = 'https://www.housebeautiful.com/lifestyle/g26859396/movie-homes-you-can-visit/'

soup = BeautifulSoup((requests.get(url)).content, 'lxml')
slide_deks = soup.find_all('div', attrs={'class': 'slideshow-slide-dek'})
addresses = []
for slide_dek in slide_deks:
    addresses.extend(slide_dek.find_all('p')[-1])

print(*addresses, sep="\n")

Output:

836 Traction Avenue, Los Angeles, CA 90013
320 Jefferson St, Natchitoches, LA 71457
1709 Broderick St., San Francisco, CA 94115 
3159 W 11th St, Cleveland, OH 44109
2715 N Junett St., Tacoma, WA 98407
304 N Canyon Blvd, Monrovia, CA 91016 
112 Ocean Ave, Amityville, NY 11701
4196 Colfax Ave, Studio City, CA 91604
318 Essex St, Salem, MA 01970 
3828 Piermont Dr. NE, Albuquerque, NM 87111
1155 103rd St (at Bay Harbor Club), Miami Beach, FL 33154
12 Picket Post Close, Winkfield Row, Bracknell RG12 9FG
8 Circle St. Layton, Perryopolis, Pennsylvania 15473
64 E Main St., Freehold, NJ 07728
1883 Orlando Rd., San Marino, CA 91108
1000 Mission St., South Pasadena, CA 91030
251 North Bristol Avenue, Los Angeles, CA 90049
 4160 Country Club Dr., Long Beach, CA 90807
344 Fremont St, Woodstock, IL 60098
204 Martins Point Rd., Wadmalaw Island, SC 29487
671 Lincoln Ave, Winnetka, IL 60093
481 Cold Canyon Rd, Calabasas, CA 91302
2640 Steiner St., San Francisco, CA 94115
843 S El Molino Ave, Pasadena, CA 91106
3333 NW Quimby St, Portland, OR 97210
3 Ocean Ave, Salem, MA 01970
102 N Pacific St, Oceanside, CA 92054
90 Bedford St, New York, NY 10014
199 Feeks Lane, Locust Valley, NY 11560
  • Related