I want to extract data from a tag to simply retrieve the text. Unfortunately I can't extract just the text, I always have links in this one.
Is it possible to remove all of the <img>
and <a href>
tags from my text?
<div data-handler="xxx">its a good day
<a href="https://" title="text">https:// link</a></div>
I just want to recover this : its a good day
and ignore the content of the <a href>
tag in my <div>
tag
Currently I perform the extraction via a beautifulsoup.find('div)
CodePudding user response:
Let's import re
and use re.sub
:
import re
s1 = '<div data-handler="xxx">its a good day'
s2 = '<a href="https://" title="text">https:// link</a></div>'
s1 = re.sub(r'\<[^()]*\>', '', s1)
s2 = re.sub(r'\<[^()]*\>', '', s2)
Output
>>> print(s1)
... 'its a good day'
>>> print(s2)
... ''
CodePudding user response:
Try to do this
import requests
from bs4 import BeautifulSoup
#response = requests.get('your url')
html = BeautifulSoup('''<div data-handler="xxx">its a good day
<a href="https://" title="text">https:// link</a>
</div>''', 'html.parser')
soup = html.find_all(class_='xxx')
print(soup[0].text.split('\n')[0])
CodePudding user response:
EDIT
Based on your comment, that all text before <a>
should be captured and not only the first one in element, select all previous_siblings
and check for NavigableString
:
' '.join(
[s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)
Example
from bs4 import Tag, NavigableString, BeautifulSoup
html='''
<div data-handler="xxx"><br>New wallpaper <br>Find over 100 of <a href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)
' '.join(
[s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)
To focus just on the text and not the children tags of an element, you could use :
.find(text=True)
In case the pattern is always the same and text the first part of content in the element:
.contents[0]
Example
from bs4 import BeautifulSoup
html='''
<div data-handler="xxx">its a good day
<a href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)
soup.div.find(text=True).strip()
Output
its a good day