Hello this is a dummy code but I want to get the Doodle text from the second <a>
:
<div >
<a href="www.google.com"></a>
<a href="https://www.google.com/doodles"> Doodle </a>
</div>
These are my failed codes:
soup.find('div', {'class' : 'test'}) #1
soup.find('div', {'class' : 'test'}).next_sibling #2
CodePudding user response:
doodletext = soup.find('div', {'class' : 'test'})
print(doodletext.text)
This will work only for this example. If you need to find Doodle
and there are other text around, you may need to use the split()
function to drill down the specific string of text you are looking for.
CodePudding user response:
There are a lot of ways to get your goal, essential is as always the pattern or structure you have to work with.
Assuming that the <a>
is still the second, you could use css selectors
like :nth-of-type(2)
:
soup.select_one('div.test a:nth-of-type(2)').get_text(strip=True)
#Doodle
Assuming it is always the last one you could also use the index of your ResultSet
:
soup.find('div', {'class' : 'test'}).find_all('a')[-1].get_text(strip=True)
#Doodle
or again alternative with css selectors
for last one:
soup.select_one('div.test a:last-of-type').get_text(strip=True)
#Doodle
soup.select('div.test a')[-1].get_text(strip=True)
#Doodle
Example
from bs4 import BeautifulSoup
html = '''
<div >
<a href="www.google.com"></a>
<a href="https://www.google.com/doodles"> Doodle </a>
</div>
'''
soup = BeautifulSoup(html)
print(soup.select_one('div.test a:nth-of-type(2)').get_text(strip=True))
print(soup.find('div', {'class' : 'test'}).find_all('a')[-1].get_text(strip=True))
Output for both
Doodle
CodePudding user response:
Check this question: beautifulsoup - extracting link within a div
And try this:
for div in soup.find_all('div', {'class': 'test'}):
a = div.find_all('a')[1]
print(a.text.strip())