Home > other >  Extracting text from a nested tag
Extracting text from a nested tag

Time:01-17

The following is the nested tag I would like to proceed with

<h5><span >31.8 萬</span>2014 NISSAN MARCH</h5>

And here is my successful attempt to extract price unit.

price = i.find("span", attrs = {"class" : "price"})

However, when i tried

name = i.find("h5").span.find_next_sibling(text=True)

it says 'NoneType' object has no attribute 'find_next_sibling'. I hope there is a solution that is similar to my successful attempt. Thank you. ; )

Edit: The following is my complete code.

def get_basic_info(content_list):
    basic_info = []
    for item in content_list:
        basic_info.append(item.find_all('h5'))
    return basic_info

names = []
def get_names(basic_info):
    for item in basic_info:
        for i in item:
            name = i.find("span", attrs = {"class" : "price"}).find_next_sibling()
        if name:
            names.append(name.text)
          
    return(names)

for page in range(1,18):
    base_url = "https://www.easycar.tw/carList.php?Action=search&show=col&lifting=desc&year=&year1=&page="  str(page)
    response = get(base_url, headers=headers)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    content_list = html_soup.find_all('div', attrs={'class': 'caption'})
    basic_info = get_basic_info(content_list)
    names = get_names(basic_info)

CodePudding user response:

i.find("h5").span.find_next_sibling(text=True) should give you '2014 NISSAN MARCH' but you are not getting the correct h5 I am sure. You should try printing out i.find("h5") to see if its the correct heading you want. Alternatively, you can get the answer you want by

soup.find("span", attrs = {"class" : "price"}).find_next_sibling(text=True)


Edit after question was updated:

Since the i in loop is already a h5 Tag object, we don't have to find it. Here is the updated code that works

i.span.find_next_sibling(text=True)

CodePudding user response:

There are different ways to get your goal - Selecting the correct <h5> you could dive a bit deeper with this answer: How to scrape last string of <p> tag element?

Following your example iterate all <span> of your selection - Used css selectors here to be more specific:

for e in soup.select('h5 span.price'):
    print(e.next_sibling.strip())

Example

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.easycar.tw/carList.php?Action=search&show=col&lifting=desc&year=&year1=&page=1'

soup = BeautifulSoup(requests.get(base_url).text)

for e in soup.select('h5 span.price'):
    print(e.next_sibling.strip())
    ### or getting only car names
    print(e.next_sibling.strip().split(' ', 1)[-1])

Output

2020 TOYOTA ALTIS
2019 NISSAN LIVINA
2019 TOYOTA VIOS
2016 HONDA CITY
2017 HONDA HR-V
2018 TOYOTA YARIS
...

or

TOYOTA ALTIS
NISSAN LIVINA
TOYOTA VIOS
HONDA CITY
HONDA HR-V
...
  • Related