Home > Enterprise >  How to handle arbitary number of returns from Scraping requests in Python, Bs4
How to handle arbitary number of returns from Scraping requests in Python, Bs4

Time:08-19

I am trying to scrape a webpage Realtor and I am succesful in doing so by using Requests, BS4 but the main problem is sometimes it returns me 1 or sometimes 2 depending if the item is present in listing or not. Both of these items have same tag Div and class name so I can't differentiate them.

My code is below:

import requests
from bs4 import BeautifulSoup
import pandas as pd

html = requests.get('https://www.realtor.com/realestateagents/84664/pg-1')
doc = BeautifulSoup(html.text,'html.parser')

names = []
contacts = []
for_sale = []
sold = []
price_range = []
last_listing_date = []

for box in doc.find_all('div', class_='jsx-3970352998 agent-list-card clearfix'):
    names.append(box.find('div', class_='jsx-3970352998 agent-name text-bold').text)

try:
    contacts.append(box.find('div', class_='jsx-3970352998 agent-phone hidden-xs hidden-xxs'))
except IndexError:
    contacts.append('No contact number found')
    
property_data = box.find_all('div', class_='jsx-3970352998 agent-detail-item ellipsis')

try:
    for_sale.append(property_data[0].span.text)
except:
    for_sale.append('None')
try:
    sold.append(property_data[1].span.text)
except:
    sold.append('0')
    
price_activity = box.find_all('div', class_='jsx-3970352998 second-column col-lg-6 no-padding')
a = price_activity[0].find_all('div', class_='jsx-3970352998 agent-detail-item')
print(len(a))

try:
    price_range.append(a[0].span.text)
    print(a[0].span.text)
except IndexError:
    print('No activity range found')
    price_range.append('No activity range found')
try:
    print(a[1].span.text)
    last_listing_date.append(a[1].span.text)
except IndexError:
    print('No listing data found')
    last_listing_date.append('No listing data found')

df = pd.DataFrame(data={'Name':names, 'Contact':contacts, 'Active Listings':for_sale, 'Properties Sold':sold,
                   'Price Range':price_range, 'Last Listing Date':last_listing_date})
df

And this is my output, you can see I have highlighted with yellow the the values which are getting into wrong column, becaue some listings dont have Activity Range so they only return one thing which is Last Listing Date and my current code is not able to handle it and I am not sure how to tackle this problem. In desired output, they should be in a place where I marked as red dots.

My output

CodePudding user response:

You should be able to get the data you're looking for like this:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from tqdm import tqdm

url = "https://kfcsg.cognizantorderserv.com/nutrition-allergen"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 12)):
    r = s.get(f'https://www.realtor.com/realestateagents/84664/pg-{x}', headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    agent_cards = soup.select('div[data-testid="component-agentCard"]')
    for a in agent_cards:
        name = a.select_one('div.agent-name').get_text(strip=True)
        company = a.select_one('div.agent-group').get_text(strip=True)
        try:
            phone = a.select_one('div.agent-phone').get_text(strip=True)
        except Exception as e:
            phone = 'Phoneless'
        try:
            experience = a.select_one('div#agentExperience').get_text(strip=True)
        except Exception as e:
            experience = 'Quite inexperienced'
        try:
            h_for_sale = a.select_one('span.sale-sold-count').get_text(strip=True)
        except Exception as e:
            h_for_sale = 0
        big_list.append((name, company, phone, experience, h_for_sale))

df = pd.DataFrame(big_list, columns = ['Name', 'Company', 'Phone', 'Experience', 'For sale'])
    
print(df)

Result:

Name Company Phone Experience For sale
0 Martha McMullin The Group Real Estate, LLC (303) 638-1033 Experience:8 years 2
1 Aren Bybee R and R Realty, LLC (801) 210-1461 Experience:22 years 2 months 31
2 Kenny ParcellTeam Equity Real Estate - Utah (801) 794-7777 Experience:26 years 7 months 24
3 Eric MossTeam Equity Real Estate - Utah (801) 669-0383 Experience:10 years 5 months 10
4 Chantelle Rees Equity Real Estate - Results (801) 636-2515 Quite inexperienced 4

[...]

Using the logic above, you can obtain other info as well and include it into dataframe. BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html Also, TQDM: https://pypi.org/project/tqdm/

CodePudding user response:

It seems to be that the element locator strategy was not in proper way.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url='https://www.realtor.com/realestateagents/84664/pg-{page}'
data =[]
for page in range(1,6):
    req = requests.get(url.format(page=page))
    soup = BeautifulSoup(req.text,'html.parser')

    for card in soup.select('div.cardWrapper > ul > div'):
        names = card.select_one('div[]').text
        contacts = card.select_one('div[]').get_text(strip=True)
        for_sale = card.select_one('div[]:nth-child(1) > span').text
        sold = card.select_one('div[]:nth-child(1) > span').text
        price = card.select_one('div:-soup-contains("Activity range") > span')
        price_range = price.text if price else None
        date = card.select_one('div:-soup-contains("Listed a house") > span')
        last_listing_date = date.text if date else None

        data.append({
            'names':names,
            'contacts':contacts,
            'for_sale':for_sale,
            'sold':sold,
            'price_range':price_range,
            'last_listing_date':last_listing_date
        })
   

df = pd.DataFrame(data)

print(df)

Output:

               names                                       contacts  ...     price_range last_listing_date
0        Clint Allred                Kw South Valley Keller Williams  ...  $370K - $1.08M        2022-08-18
1     Martha McMullin                     The Group Real Estate, LLC  ...   $495K - $995K        2022-08-18
2          Aren Bybee                            R and R Realty, LLC  ...  $115K - $2.49M        2022-08-18
3   Kenny ParcellTeam                      Equity Real Estate - Utah  ...   $125K - $1.2M        2022-08-17
4       Eric MossTeam                      Equity Real Estate - Utah  ...   $125K - $600K        2022-08-17
..                ...                                            ...  ...             ...               ...
95      Marny Schlopy                         Coldwell Banker Realty  ...   $410K - $756K              None
96  Amy Laster-Haynes  Better Homes and Gardens Real Estate Momentum  ...  $364K - $2.62M              None       
97         Raquel Jex                   Presidio Real Estate Company  ...   $442K - $442K              None       
98   Kelly Ercanbrack                              Unite Real Estate  ...   $400K - $400K              None       
99    Camie Jefferies                    Equity Real Estate - Tooele  ...   None Reported              None       

[100 rows x 6 columns]
  • Related