Why am I getting just one item (instead of multiple items) in a pandas column?-CodePudding

Here is my code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import pandas as pd

driver = webdriver.Chrome(service=Service(executable_path=ChromeDriverManager().install()))
driver.maximize_window()
driver.get('https://quotes.toscrape.com/')

df = pd.DataFrame(
    {        
        'Quote': [''],        
        'Author': [''],
        'Tags': [''],
    }
)

quotes = driver.find_elements(By.CSS_SELECTOR, '.quote')
for quote in quotes:
    text = quote.find_element(By.CSS_SELECTOR, '.text')
    author = quote.find_element(By.CSS_SELECTOR, '.author')
    
    tags = quote.find_elements(By.CSS_SELECTOR, '.tag')
    for tag in tags:
        quote_tag = tag

    df = df.append(
        {            
            'Quote': text.text,
            'Author': author.text,            
            'Tags': quote_tag.text,
        },        
        ignore_index = True
    )

df.to_csv('C:/Users/Jay/Downloads/Python/!Learn/practice/scraping/selenium/quotes.csv', index=False)

I should be getting this result:

Quote	Author	Tags
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”	Albert Einstein	change deep-thoughts thinking world

Instead I'm getting this:

Quote	Author	Tags
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”	Albert Einstein	world

I'm getting just the last item in the Tags column instead of all four items.

If I run:

quotes = driver.find_elements(By.CSS_SELECTOR, '.quote')
for quote in quotes:        
    tags = quote.find_elements(By.CSS_SELECTOR, '.tag')
    for tag in tags:
        quote_tag = tag
        print(quote_tag.text)

I get:

change
deep-thoughts
thinking
world
etc

So that piece of code works.

Why isn't the Tags column being populated appropriately?

CodePudding user response：

With your code

for tag in tags:
    quote_tag = tag

you replace quote_tag with tag on each run of the for loop and thus override the previous value stored in quote_tag. Thus, after the last run, quote_tag only contains the last tag.

You need to do something like

quote_tag = ''
for tag in tags:
    quote_tag  = ' '   tag

if you want to concatenate all tags together.

CodePudding user response：

For your loop, use this code:

quote_tags = []
for tag in tags:
    quote_tags.append(tag.text)

df = df.append(
    {            
        'Quote': text.text,
        'Author': author.text,            
        'Tags': ' '.join(quote_tags),
    },        
    ignore_index = True
)

If you notice, the only tag that's being added (world) happens to be the very last tag...and that's not a coincidence. It's because you loop over the tags, and for each tag, you assign that tag to the quote_tag variable, but you don't do anything with it, so the next loop iteration just overwrites the value set by the previous iteration. Finally, when the loop is over, quote_tag has the value of the last tag.