How to scrape next span tag?-CodePudding

Html code I want to scrape

<div  data-spm="breadcrumb" id="J_breadcrumb_list">
<ul  id="J_breadcrumb">
<li >
<span >
<a  href="https://www.lazada.vn/may-vi-tinh-laptop/" title="Computers &amp; Laptops">
<span>Computers &amp; Laptops</span>
<div ></div>
</a>
</span>
</li>
<li >
<span >
<a  href="https://www.lazada.vn/laptop/" title="Laptops">
<span>Laptops</span>
<div ></div>
</a>
</span>
</li>
<li >
<span >
<a  href="https://www.lazada.vn/laptop-co-ban/" title="Traditional Laptops">
<span>Traditional Laptops</span>
<div ></div>
</a>
</span>
</li>

My code to scrape first span tag (Computer & Laptop) using BeautifulSoup in Python 3. How can I access & scrape second span tag (Laptop)?

def get_url1(search_term1):
    template1 = 'https://www.lazada.vn//www.lazada.vn/products/{}.html'
    search_term1 = search_term1.replace(' ', ' ')
    return template1.format(search_term1)

url1 = get_url1(tag1.get('href'))
driver.get(url1)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')
results1 = soup1.find_all('div', {'id': 'J_breadcrumb_list'})
item1 = results1[0]
tag2 = item1.span.a
tag2.text.strip()

CodePudding user response：

What you need is the content of a span tag. What you need to do is look for the <span> in the "soup" you have already obtained.

def get_url1(search_term1):
    template1 = 'https://www.lazada.vn//www.lazada.vn/products/{}.html'
    search_term1 = search_term1.replace(' ', ' ')
    return template1.format(search_term1)

url1 = get_url1(tag1.get('href'))
driver.get(url1)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')

// New code
results1 = soup1.find_all('div', {'id': 'J_breadcrumb_list'})
results2 = results1.find_all('span')

.find_all() returns a list so you can iterate over the list and process the data. To get the content of the span tag, just access the attribute with .text.

def get_url1(search_term1):
    template1 = 'https://www.lazada.vn//www.lazada.vn/products/{}.html'
    search_term1 = search_term1.replace(' ', ' ')
    return template1.format(search_term1)

url1 = get_url1(tag1.get('href'))
driver.get(url1)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')

// New code
results1 = soup1.find_all('div', {'id': 'J_breadcrumb_list'})
results2 = results1.find_all('span')

for span in results2:
    print(span.text)
    // do staff

Edit: If I may, I would like to give you just a tip on the code, to make it clearer. Try to use meaningful names for the variables, so that it is much easier to read.

def getSearchUrl(keywords):
    keywords = search_term1.replace(' ', ' ')
    url = f'https://www.lazada.vn//www.lazada.vn/products/{keywords}.html'
    return url

url = getSearchUrl(tag.get('href'))
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')

divs = soup.find_all('div', {'id': 'J_breadcrumb_list'})
spans = divs.find_all('span')

for span in spans:
    print(span.text)
    // do staff

CodePudding user response：

Many tks for all attentions. You can change the character in brackets 'item1 = results1[0]' to get the result.

My solution is

def get_url1(search_term1):
    template1 = 'https://www.lazada.vn//www.lazada.vn/products/{}.html'
    search_term1 = search_term1.replace(' ', ' ')
    return template1.format(search_term1)

url1 = get_url1(tag1.get('href'))
driver.get(url1)

soup1 = BeautifulSoup(driver.page_source, 'html.parser')
results1 = soup1.find_all('span', class_='breadcrumb_item_text')

item1 = results1[0]
item1.text.strip()