Home > OS >  Web scraping ul li tags
Web scraping ul li tags

Time:07-01

I am trying to scrape the ul & li tags for capterra product pages. The information I want to get and store in separate variables is the "located in 'country," "the url address," and the product features.

Currently, I only know how to print the text for everything in the ul & li, not something specific.

Code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
import requests

driver = webdriver.Firefox()
driver.get("https://www.capterra.com/p/81310/AMCS/")

companyProfile = bs(driver.page_source, 'html.parser')

url = companyProfile.find("ul", class_="nb-type-md nb-list-undecorated undefined").text

features = companyProfile.find("div", class_="nb-col-count-1 sm:nb-col-count-2 md:nb-col-count-3 nb-col-gap-xl nb-my-0 nb-mx-auto").text 

print(url)
print(features)

driver.close()

Output:

AMCSLocated in United StatesFounded in 2004http://www.amcsgroup.com/
Billing & InvoicingBrokerage ManagementBuy / Sell TicketingContainer ManagementCustomer AccountsCustomer DatabaseDispatch ManagementElectronics RecyclingEquipment TrackingFingerprint ScanningID ScanningIntegrated CamerasInventory ManagementInventory TrackingLogistics Management

How do I get only the url and the country, and how do I get the features neatly?


I was able to get the URL and the location by:

url = driver.find_element(By. XPATH, "//*[starts-with(., 'http')]").text

location = driver.find_element(By. XPATH, "//*[starts-with(., 'Located in')]").text

Still looking for a solution for the features.

CodePudding user response:

The following code will pull all the text nodes value separately from ul > li tags

soup = BeautifulSoup(driver.page_source, 'html.parser')

for li in soup.find_all('ul',class_="nb-type-md nb-list-undecorated undefined"):
    name = li.select_one('[] li:nth-child(1) > span').get_text()
    location = li.select_one('[] li:nth-child(2) > span').get_text()
    year = li.select_one('[] li:nth-child(3) > span').get_text()
    link = li.select_one('[] li:nth-child(4) > span').get_text()

    print(name)
    print(location)
    print(year)
    print(link)

Output:

AMCS
Located in United States 
Founded in 2004
http://www.amcsgroup.com/

Update:

li=[x.get_text() for x in soup.select('[] li span')]
print(li) 

Output:

['AMCS', 'Located in United States', 'Founded in 2004', 'http://www.amcsgroup.com/']
  • Related