Home > Enterprise >  How to scrape data from specific child of an element?
How to scrape data from specific child of an element?

Time:09-18

Hello this is a dummy code but I want to get the Doodle text from the second <a>:

<div >
    <a href="www.google.com"></a>
    <a href="https://www.google.com/doodles"> Doodle </a>
</div>

These are my failed codes:

soup.find('div', {'class' : 'test'}) #1
soup.find('div', {'class' : 'test'}).next_sibling #2

CodePudding user response:

doodletext = soup.find('div', {'class' : 'test'})
print(doodletext.text)

This will work only for this example. If you need to find Doodle and there are other text around, you may need to use the split() function to drill down the specific string of text you are looking for.

CodePudding user response:

There are a lot of ways to get your goal, essential is as always the pattern or structure you have to work with.

Assuming that the <a> is still the second, you could use css selectors like :nth-of-type(2):

soup.select_one('div.test a:nth-of-type(2)').get_text(strip=True)
#Doodle

Assuming it is always the last one you could also use the index of your ResultSet:

soup.find('div', {'class' : 'test'}).find_all('a')[-1].get_text(strip=True)  
#Doodle

or again alternative with css selectors for last one:

soup.select_one('div.test a:last-of-type').get_text(strip=True)
#Doodle

soup.select('div.test a')[-1].get_text(strip=True)
#Doodle

Example

from bs4 import BeautifulSoup

html = '''
<div >
    <a href="www.google.com"></a>
    <a href="https://www.google.com/doodles"> Doodle </a>
</div>
'''
soup = BeautifulSoup(html)

print(soup.select_one('div.test a:nth-of-type(2)').get_text(strip=True))

print(soup.find('div', {'class' : 'test'}).find_all('a')[-1].get_text(strip=True))

Output for both

Doodle

CodePudding user response:

Check this question: beautifulsoup - extracting link within a div

And try this:

for div in soup.find_all('div', {'class': 'test'}):
    a = div.find_all('a')[1]
    print(a.text.strip())
  • Related