Home > front end >  Finding the correct HTML tag for data scraping in python
Finding the correct HTML tag for data scraping in python

Time:02-23

I am new to the data scraping in python and I wanted some help in finding what is correct tag and class I should put in my code to get the info out.

<tr><td > Color </td>
    <td>Iron Grey </td></tr>

So this is the Html code . I need to get "Iron Grey" but since it is not having any class associated to it I am not able to scrape it . If I put class as specs_heading in the code I get color instead of Iron Grey.

Would be great if someone could help Thanks !

CodePudding user response:

You can use find() and then call the findChildren() method on it while indexing which child you wish to return. In this example indexing at 1 will return the second child node which corresponds to the second <td> that is wrapping the Iron Grey text.

Example Code:

from bs4 import BeautifulSoup

sample_html = """<html><tr><td > Color </td>
    <td>Iron Grey </td></tr></html>"""

soup = BeautifulSoup(sample_html, "lxml")

parent_node = soup.find("tr")
matched_child_node = parent_node.findChildren()[1].text

print(matched_child_node) # Iron Grey

I can't say if this is the best way to get the data you want. But it is one way if you know exactly what you want and where it is positioned in the DOM.

CodePudding user response:

The tag you are looking for is the second next sibling of "specs-heading" (the first sibling being the line break):

soup.find(class_="specs-heading").next_sibling.next_sibling.string
#'Iron Grey '

If you want a more robust solution that does not depend on the presence/absence of line breaks, search the list of the next siblings of the cell that belongs to your class:

soup.find(class_="specs-heading")\
    .find_next_siblings('td')[0].string
#'Iron Grey '
  • Related