I'm trying to scrape company profile pages from capterra using selenium. Capterra loads profile pages in batches of 25. My code is able to get the first 5, but then returns "none" for the other 20 on the page.
Code:
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
driver = webdriver.Firefox()
driver.get("https://www.capterra.com/waste-management-software/")
page = bs(driver.page_source, 'html.parser')
# Hits "Show More" button
driver.find_element(By. XPATH, "//*[contains(text(), 'Show More')]").click()
# Grabs Company portfolio page links
plinks = [div.a for div in page.findAll("div", attrs={"class" : "nb-mb-0"})]
for link in plinks:
print(link)
driver.close()
Output:
<a href="/p/81310/AMCS/"><img alt="" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/946474e4-bd54-451d-bbaf-9c5602b2f399.png?auto=compress,format&w=180&h=180"/></a>
<a href="/p/103755/HazMat-T-T/"><img alt="" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/838db9d8-c251-4d78-aa69-a9cd745ef6b9.png?auto=compress,format&w=180&h=180"/></a>
<a href="/p/79230/WAM-Hauler-Easy-Bill-Route/"><img alt="" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/0820f6ea-9d9d-4062-987b-a3fcf25f2813.png?auto=compress,format&w=180&h=180"/></a>
<a href="/p/152697/Waste-Management-Software/"><img alt="" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/64597b5d-84e5-464c-ae60-84a1c5ad4976.png?auto=compress,format&w=180&h=180"/></a>
<a href="/p/177472/Via-Analytics/"><img alt="" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/c20bf8d6-88cc-49d5-8424-b724ba734d4a.png?auto=compress,format&w=180&h=180"/></a>
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
What I really need from the output is the hrefs that contain "/p/". Hit the "Show More" button on the page, and then collect the next 25 links, hit button, etc.
Thanks!
CodePudding user response:
You don't need selenium. Here you have an API and you can scrape directly API with one request it returns you 125 objects you need.
import json
import requests
headers = {
'accept': '*/*',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,es;q=0.7,ru;q=0.6',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
params = {'htmlName': 'waste-management-software', 'countryCode': 'ES'}
base_url = "https://www.capterra.com/p/"
response = requests.get('https://www.capterra.com/directoryPage/rest/v1/category', params=params, headers=headers)
json = json.loads(response.content)
products = json["pageData"]["categoryData"]["products"]
print("Total elements: " str(len(products)))
for product in products:
print("Name: " product["product_name"])
print("URL: " base_url str(product["product_id"]) "/" product["product_slug"] "/")
print("Product url: " product["product_url"])
print("Image: " product["logo_filepath"])
print("Rating: " str(product["rating"]))
print()
OUTPUT:
Total elements: 125
Name: FAMA
URL: https://www.capterra.com/p/86768/FAMA/
Product url: https://info.gartnerdigitalmarkets.com/fama-es-gdm-lp
Image: https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/7a7a8467-9a2c-40d9-8488-7d6c3c0dec52.jpeg
Rating: 3.6
Name: Quentic
URL: https://www.capterra.com/p/127188/Quentic/
Product url: https://go.quentic.com/hazardous-materials-management-software
Image: https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/ba5e26a7-375d-4423-a1f2-68a27d5318c5.png
Rating: 4.8