Home > Software design >  Extract specific tag tag from the <p>
Extract specific tag tag from the <p>

Time:11-26

I want to extract the address from the p tag only such as I want to get these Santa Barbara, CA 93101

[<p class="hide" id="phoneDiv_80863"><i aria-hidden="true" class="fa fa-phone-square"></i> (805) 636-9890</p>, <p>

Santa Barbara, CA 93101



</p>, <p style="margin-top:2em;"><a class="btn btn-default" href="/profile/id/80863/NicoleABotaitis93101" target="_top">View</a> <a class="btn btn-default" href="mailto:[email protected]" id="eml80863" target="_top">Email</a></p>]
[]
[<p class="hide" id="phoneDiv_26092"><i aria-hidden="true" class="fa fa-phone-square"></i> 8058956960</p>, <p>

Santa Barbara, CA 93111

Code

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

limit = 25

url = f'https://www.counselingcalifornia.com/cc/cgi-bin/utilities.dll/customlist?FIRSTNAME=~&LASTNAME=~&ZIP=&DONORCLASSSTT=&_MULTIPLE_INSURANCE=&HASPHOTOFLG=&_MULTIPLE_EMPHASIS=&ETHNIC=&_MULTIPLE_LANGUAGE=ENG&QNAME=THERAPISTLIST&WMT=NONE&WNR=NONE&WHP=therapistHeader.htm&WBP=therapistList.htm&RANGE=1/{limit}&SORT=LASTNAME'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Mobile Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
rows = soup.find_all('div', {'class':'row'})
temp=[]
for row in rows:
    t=row.find_all('div',class_='col-sm-3')
    for i in t:
        u=i.find_all('p')
        print(u)

CodePudding user response:

The Accepted Answer has 3 nested loops which has performance implications.

Here is a better solution using css-selector:

import requests
from bs4 import BeautifulSoup
import pandas as pd

limit = 25

url = f'https://www.counselingcalifornia.com/cc/cgi-bin/utilities.dll/customlist?FIRSTNAME=~&LASTNAME=~&ZIP=&DONORCLASSSTT=&_MULTIPLE_INSURANCE=&HASPHOTOFLG=&_MULTIPLE_EMPHASIS=&ETHNIC=&_MULTIPLE_LANGUAGE=ENG&QNAME=THERAPISTLIST&WMT=NONE&WNR=NONE&WHP=therapistHeader.htm&WBP=therapistList.htm&RANGE=1/{limit}&SORT=LASTNAME'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Mobile Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

for address in soup.select('.col-sm-3>p:nth-child(3)'):
    print(address.text.strip())

Sample Output:

Santa Barbara, CA 93101
Santa Barbara, CA 93111
Santa Barbara, CA 93101
Tustin, CA 92780
Valencia, CA 91355
Pasadena, CA 91105
United States
Walnut Creek, CA 94596
Woodland Hills, CA 91365-0644
Monterey, CA 93940
Granada Hills, CA 91344
United States
Studio City, CA 91604
Santa Rosa, CA 95404
Sonoma
San Dimas, CA 91773
United States
San Francisco, CA 94116
Rancho Mirage, CA 92270
Berkeley, CA 94705-1808
Anderson, CA 96007
Shasta
Mission Viejo, CA 92691
United States
Claremont, CA 91711
Seal Beach, CA 90740
USA
West Covina, CA 91790
Los Angeles
Mission Viejo, CA 92692
Laguna Niguel, CA 92677
Camarillo, CA 93010
West Hills, CA 91308

References:

CodePudding user response:

Is this what you are looking for:

soup = BeautifulSoup(response.text, 'html.parser')
rows = soup.find_all('div', {'class':'row'})
temp=[]
for row in rows:
    t=row.find_all('div',class_='col-sm-3')
    for i in t:
        u=i.find_all('p')[1:2]
        for each_u in u:
            address = each_u.text.split('\n')[1]
            print(address)
  • Related