Home > Software engineering >  How to remove all balise in text python
How to remove all balise in text python

Time:11-27

I want to extract data from a tag to simply retrieve the text. Unfortunately I can't extract just the text, I always have links in this one.

Is it possible to remove all of the <img> and <a href> tags from my text?

<div  data-handler="xxx">its a good day
<a  href="https://" title="text">https:// link</a></div>

I just want to recover this : its a good day and ignore the content of the <a href> tag in my <div> tag

Currently I perform the extraction via a beautifulsoup.find('div)

CodePudding user response:

Let's import re and use re.sub:

import re 

s1 = '<div  data-handler="xxx">its a good day'
s2 = '<a  href="https://" title="text">https:// link</a></div>'
    
    
s1 = re.sub(r'\<[^()]*\>', '', s1)
s2 = re.sub(r'\<[^()]*\>', '', s2)

Output

>>> print(s1)
... 'its a good day'
>>> print(s2)
... ''

CodePudding user response:

Try to do this

import requests
from bs4 import BeautifulSoup

#response = requests.get('your url')

html = BeautifulSoup('''<div  data-handler="xxx">its a good day
<a  href="https://" title="text">https:// link</a> 
</div>''', 'html.parser')

soup = html.find_all(class_='xxx')

print(soup[0].text.split('\n')[0])

CodePudding user response:

EDIT

Based on your comment, that all text before <a> should be captured and not only the first one in element, select all previous_siblings and check for NavigableString:

' '.join(
    [s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)

Example

from bs4 import Tag, NavigableString, BeautifulSoup

html='''
<div  data-handler="xxx"><br>New wallpaper <br>Find over 100  of <a  href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)

' '.join(
    [s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)

To focus just on the text and not the children tags of an element, you could use :

.find(text=True)

In case the pattern is always the same and text the first part of content in the element:

.contents[0]

Example

from bs4 import BeautifulSoup
html='''
<div  data-handler="xxx">its a good day
<a  href="https://" title="text">https:// link</a></div>
'''

soup = BeautifulSoup(html)

soup.div.find(text=True).strip()

Output

its a good day
  • Related