Html code I want to scrape
<div data-spm="breadcrumb" id="J_breadcrumb_list">
<ul id="J_breadcrumb">
<li >
<span >
<a href="https://www.lazada.vn/may-vi-tinh-laptop/" title="Computers & Laptops">
<span>Computers & Laptops</span>
<div ></div>
</a>
</span>
</li>
<li >
<span >
<a href="https://www.lazada.vn/laptop/" title="Laptops">
<span>Laptops</span>
<div ></div>
</a>
</span>
</li>
<li >
<span >
<a href="https://www.lazada.vn/laptop-co-ban/" title="Traditional Laptops">
<span>Traditional Laptops</span>
<div ></div>
</a>
</span>
</li>
My code to scrape first span tag (Computer & Laptop) using BeautifulSoup in Python 3. How can I access & scrape second span tag (Laptop)?
def get_url1(search_term1):
template1 = 'https://www.lazada.vn//www.lazada.vn/products/{}.html'
search_term1 = search_term1.replace(' ', ' ')
return template1.format(search_term1)
url1 = get_url1(tag1.get('href'))
driver.get(url1)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')
results1 = soup1.find_all('div', {'id': 'J_breadcrumb_list'})
item1 = results1[0]
tag2 = item1.span.a
tag2.text.strip()
CodePudding user response:
What you need is the content of a span tag. What you need to do is look for the <span>
in the "soup" you have already obtained.
def get_url1(search_term1):
template1 = 'https://www.lazada.vn//www.lazada.vn/products/{}.html'
search_term1 = search_term1.replace(' ', ' ')
return template1.format(search_term1)
url1 = get_url1(tag1.get('href'))
driver.get(url1)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')
// New code
results1 = soup1.find_all('div', {'id': 'J_breadcrumb_list'})
results2 = results1.find_all('span')
.find_all()
returns a list so you can iterate over the list and process the data. To get the content of the span tag, just access the attribute with .text
.
def get_url1(search_term1):
template1 = 'https://www.lazada.vn//www.lazada.vn/products/{}.html'
search_term1 = search_term1.replace(' ', ' ')
return template1.format(search_term1)
url1 = get_url1(tag1.get('href'))
driver.get(url1)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')
// New code
results1 = soup1.find_all('div', {'id': 'J_breadcrumb_list'})
results2 = results1.find_all('span')
for span in results2:
print(span.text)
// do staff
Edit: If I may, I would like to give you just a tip on the code, to make it clearer. Try to use meaningful names for the variables, so that it is much easier to read.
def getSearchUrl(keywords):
keywords = search_term1.replace(' ', ' ')
url = f'https://www.lazada.vn//www.lazada.vn/products/{keywords}.html'
return url
url = getSearchUrl(tag.get('href'))
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
divs = soup.find_all('div', {'id': 'J_breadcrumb_list'})
spans = divs.find_all('span')
for span in spans:
print(span.text)
// do staff
CodePudding user response:
Many tks for all attentions. You can change the character in brackets 'item1 = results1[0]' to get the result.
My solution is
def get_url1(search_term1):
template1 = 'https://www.lazada.vn//www.lazada.vn/products/{}.html'
search_term1 = search_term1.replace(' ', ' ')
return template1.format(search_term1)
url1 = get_url1(tag1.get('href'))
driver.get(url1)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')
results1 = soup1.find_all('span', class_='breadcrumb_item_text')
item1 = results1[0]
item1.text.strip()