I am trying to scrape a webpage Realtor and I am succesful in doing so by using Requests, BS4 but the main problem is sometimes it returns me 1 or sometimes 2 depending if the item is present in listing or not. Both of these items have same tag Div and class name so I can't differentiate them.
My code is below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
html = requests.get('https://www.realtor.com/realestateagents/84664/pg-1')
doc = BeautifulSoup(html.text,'html.parser')
names = []
contacts = []
for_sale = []
sold = []
price_range = []
last_listing_date = []
for box in doc.find_all('div', class_='jsx-3970352998 agent-list-card clearfix'):
names.append(box.find('div', class_='jsx-3970352998 agent-name text-bold').text)
try:
contacts.append(box.find('div', class_='jsx-3970352998 agent-phone hidden-xs hidden-xxs'))
except IndexError:
contacts.append('No contact number found')
property_data = box.find_all('div', class_='jsx-3970352998 agent-detail-item ellipsis')
try:
for_sale.append(property_data[0].span.text)
except:
for_sale.append('None')
try:
sold.append(property_data[1].span.text)
except:
sold.append('0')
price_activity = box.find_all('div', class_='jsx-3970352998 second-column col-lg-6 no-padding')
a = price_activity[0].find_all('div', class_='jsx-3970352998 agent-detail-item')
print(len(a))
try:
price_range.append(a[0].span.text)
print(a[0].span.text)
except IndexError:
print('No activity range found')
price_range.append('No activity range found')
try:
print(a[1].span.text)
last_listing_date.append(a[1].span.text)
except IndexError:
print('No listing data found')
last_listing_date.append('No listing data found')
df = pd.DataFrame(data={'Name':names, 'Contact':contacts, 'Active Listings':for_sale, 'Properties Sold':sold,
'Price Range':price_range, 'Last Listing Date':last_listing_date})
df
And this is my output, you can see I have highlighted with yellow the the values which are getting into wrong column, becaue some listings dont have Activity Range so they only return one thing which is Last Listing Date and my current code is not able to handle it and I am not sure how to tackle this problem. In desired output, they should be in a place where I marked as red dots.
CodePudding user response:
You should be able to get the data you're looking for like this:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from tqdm import tqdm
url = "https://kfcsg.cognizantorderserv.com/nutrition-allergen"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 12)):
r = s.get(f'https://www.realtor.com/realestateagents/84664/pg-{x}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
agent_cards = soup.select('div[data-testid="component-agentCard"]')
for a in agent_cards:
name = a.select_one('div.agent-name').get_text(strip=True)
company = a.select_one('div.agent-group').get_text(strip=True)
try:
phone = a.select_one('div.agent-phone').get_text(strip=True)
except Exception as e:
phone = 'Phoneless'
try:
experience = a.select_one('div#agentExperience').get_text(strip=True)
except Exception as e:
experience = 'Quite inexperienced'
try:
h_for_sale = a.select_one('span.sale-sold-count').get_text(strip=True)
except Exception as e:
h_for_sale = 0
big_list.append((name, company, phone, experience, h_for_sale))
df = pd.DataFrame(big_list, columns = ['Name', 'Company', 'Phone', 'Experience', 'For sale'])
print(df)
Result:
Name | Company | Phone | Experience | For sale | |
---|---|---|---|---|---|
0 | Martha McMullin | The Group Real Estate, LLC | (303) 638-1033 | Experience:8 years | 2 |
1 | Aren Bybee | R and R Realty, LLC | (801) 210-1461 | Experience:22 years 2 months | 31 |
2 | Kenny ParcellTeam | Equity Real Estate - Utah | (801) 794-7777 | Experience:26 years 7 months | 24 |
3 | Eric MossTeam | Equity Real Estate - Utah | (801) 669-0383 | Experience:10 years 5 months | 10 |
4 | Chantelle Rees | Equity Real Estate - Results | (801) 636-2515 | Quite inexperienced | 4 |
[...]
Using the logic above, you can obtain other info as well and include it into dataframe. BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html Also, TQDM: https://pypi.org/project/tqdm/
CodePudding user response:
It seems to be that the element locator strategy was not in proper way.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url='https://www.realtor.com/realestateagents/84664/pg-{page}'
data =[]
for page in range(1,6):
req = requests.get(url.format(page=page))
soup = BeautifulSoup(req.text,'html.parser')
for card in soup.select('div.cardWrapper > ul > div'):
names = card.select_one('div[]').text
contacts = card.select_one('div[]').get_text(strip=True)
for_sale = card.select_one('div[]:nth-child(1) > span').text
sold = card.select_one('div[]:nth-child(1) > span').text
price = card.select_one('div:-soup-contains("Activity range") > span')
price_range = price.text if price else None
date = card.select_one('div:-soup-contains("Listed a house") > span')
last_listing_date = date.text if date else None
data.append({
'names':names,
'contacts':contacts,
'for_sale':for_sale,
'sold':sold,
'price_range':price_range,
'last_listing_date':last_listing_date
})
df = pd.DataFrame(data)
print(df)
Output:
names contacts ... price_range last_listing_date
0 Clint Allred Kw South Valley Keller Williams ... $370K - $1.08M 2022-08-18
1 Martha McMullin The Group Real Estate, LLC ... $495K - $995K 2022-08-18
2 Aren Bybee R and R Realty, LLC ... $115K - $2.49M 2022-08-18
3 Kenny ParcellTeam Equity Real Estate - Utah ... $125K - $1.2M 2022-08-17
4 Eric MossTeam Equity Real Estate - Utah ... $125K - $600K 2022-08-17
.. ... ... ... ... ...
95 Marny Schlopy Coldwell Banker Realty ... $410K - $756K None
96 Amy Laster-Haynes Better Homes and Gardens Real Estate Momentum ... $364K - $2.62M None
97 Raquel Jex Presidio Real Estate Company ... $442K - $442K None
98 Kelly Ercanbrack Unite Real Estate ... $400K - $400K None
99 Camie Jefferies Equity Real Estate - Tooele ... None Reported None
[100 rows x 6 columns]