I am trying to scrape the ul & li tags for capterra product pages. The information I want to get and store in separate variables is the "located in 'country," "the url address," and the product features.
Currently, I only know how to print the text for everything in the ul & li, not something specific.
Code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
import requests
driver = webdriver.Firefox()
driver.get("https://www.capterra.com/p/81310/AMCS/")
companyProfile = bs(driver.page_source, 'html.parser')
url = companyProfile.find("ul", class_="nb-type-md nb-list-undecorated undefined").text
features = companyProfile.find("div", class_="nb-col-count-1 sm:nb-col-count-2 md:nb-col-count-3 nb-col-gap-xl nb-my-0 nb-mx-auto").text
print(url)
print(features)
driver.close()
Output:
AMCSLocated in United StatesFounded in 2004http://www.amcsgroup.com/
Billing & InvoicingBrokerage ManagementBuy / Sell TicketingContainer ManagementCustomer AccountsCustomer DatabaseDispatch ManagementElectronics RecyclingEquipment TrackingFingerprint ScanningID ScanningIntegrated CamerasInventory ManagementInventory TrackingLogistics Management
How do I get only the url and the country, and how do I get the features neatly?
I was able to get the URL and the location by:
url = driver.find_element(By. XPATH, "//*[starts-with(., 'http')]").text
location = driver.find_element(By. XPATH, "//*[starts-with(., 'Located in')]").text
Still looking for a solution for the features.
CodePudding user response:
The following code will pull all the text nodes value separately from ul > li
tags
soup = BeautifulSoup(driver.page_source, 'html.parser')
for li in soup.find_all('ul',class_="nb-type-md nb-list-undecorated undefined"):
name = li.select_one('[] li:nth-child(1) > span').get_text()
location = li.select_one('[] li:nth-child(2) > span').get_text()
year = li.select_one('[] li:nth-child(3) > span').get_text()
link = li.select_one('[] li:nth-child(4) > span').get_text()
print(name)
print(location)
print(year)
print(link)
Output:
AMCS
Located in United States
Founded in 2004
http://www.amcsgroup.com/
Update:
li=[x.get_text() for x in soup.select('[] li span')]
print(li)
Output:
['AMCS', 'Located in United States', 'Founded in 2004', 'http://www.amcsgroup.com/']