How to extract List items from website into DataFrame? (Clear example given)-CodePudding

I feel at the outset I should mention that this is a purely personal project.

I am looking to scrape car data from a well known car website. Their website for each car "product card" is structured as follows:

<section >
    <h3 >
Mercedes-Benz A-Class
    </h3>

    <p >
1.3 A 200 AMG LINE 5d 161 BHP | 14-DAYS MONEY BACK GUARANTEE*
    </p>

        <p >
***FREE 3 MONTHS WARRANTY***
        </p>

    <ul >

            <li >2018 (68 reg)</li>
        
            <li >Hatchback</li>
        
            <li >39,009 miles</li>
        
            <li >1.3L</li>
        
            <li >161BHP</li>
        
            <li >Automatic</li>
        
            <li >Petrol</li>
        
            <li >1 owner</li>
        
            <li >ULEZ</li>
        

    </ul>
</section>

I am able to extract the title and the subtitle in a loop quite easily as follows:

#Find Elements by Class Name. Create array of all cards
car_list = driver.find_elements(By.CLASS_NAME, "product-card-details")

titles = []
subtitles = []

for car in car_list:
    title = car.find_element(By.CLASS_NAME, "product-card-details__title").text
    subtitle = car.find_element(By.CLASS_NAME, "product-card-details__subtitle").text

However, i am having real difficulty accessing the list elements, I call them the "specs" for each vehicle. I have attempted the following:

specs = car.find_elements(By.XPATH,"//li[contains(@class, 'atc-type-picanto--medium')]")
for spec in specs:
    print(spec.get_attribute('innerHTML'))

However, this outputs all specs for all cars on each loop. (Why?)

I have also tried the following:

specs = car.find_element(By.CLASS_NAME, "listing-key-specs").get_attribute('innerHTML')
print(specs)

Which outputs:

        <li >2018 (68 reg)</li>
    
        <li >Hatchback</li>
    
        <li >39,009 miles</li>
    
        <li >1.3L</li>
    
        <li >161BHP</li>
    
        <li >Automatic</li>
    
        <li >Petrol</li>
    
        <li >1 owner</li>
    
        <li >ULEZ</li>

And i cannot seem to extract each element, it only extracts as a block.

Ideally i'd like to create a list of lists:

all_specs = [[car1spec1, car1spec2, ...], [car2spec1, car2spec2, ...]]

And so on. Any help would be much appreciated as I have spent a few days trying to figure this out.

CodePudding user response：

I created an html page with the code you pasted:

<html>
<body>
<section >
    <h3 >
Mercedes-Benz A-Class
    </h3>

    <p >
1.3 A 200 AMG LINE 5d 161 BHP | 14-DAYS MONEY BACK GUARANTEE*
    </p>

        <p >
***FREE 3 MONTHS WARRANTY***
        </p>

    <ul >

            <li >2018 (68 reg)</li>

            <li >Hatchback</li>

            <li >39,009 miles</li>

            <li >1.3L</li>

            <li >161BHP</li>

            <li >Automatic</li>

            <li >Petrol</li>

            <li >1 owner</li>

            <li >ULEZ</li>


    </ul>
</section>
</body>
</html>

Then I took your code and ran it. It worked well. This is the code I used:

from selenium import webdriver
from selenium.webdriver.common.by import By


driver = webdriver.Chrome()
driver.get('file:///home/eugene/cars_example.html')
car_list = driver.find_elements(By.CLASS_NAME, "product-card-details")

titles = []
subtitles = []

for car in car_list:
    title = car.find_element(By.CLASS_NAME, "product-card-details__title").text
    subtitle = car.find_element(By.CLASS_NAME, "product-card-details__subtitle").text
    specs = car.find_elements(By.TAG_NAME, "li")
    for spec in specs:
        print(spec.get_attribute('innerHTML'))
driver.quit()

and this is the result:

2018 (68 reg)
Hatchback
39,009 miles
1.3L
161BHP
Automatic
Petrol
1 owner
ULEZ

So, looks like everything works as expected. I'm not answering with the solution of the problem. But maybe you'll find any mistake in your code using my example.

CodePudding user response：

specs = car.find_elements(By.XPATH,".//li[contains(@class, 'atc-type-picanto--medium')]")

If you wanted to know what's wrong it's the xpath from an element you need to use a . prior to it. It's specifically the usage of xpath and not the other types here.