I have a df column which has a lot of tags. I want to clean them out.
df['text_col'].tolist()
['n</li></ul>',
'<p> bla bla bla </p>',
'bla bla </b>, </li>, <li>, </ul>',
'bla bla <strong>bla </strong>: <h3> </h3>, <ul>,<b> </p>']
I see two ways of cleaning it.
- Create a list of all tags I find in the text and then replace those with the empty string '' (can be laborious task to maintain the list)
- Some logic to remove anything that comes in < and > tags.
I dont know of any other way other than str replace.. but it doesnt quite do what I explained above.
df["text_col"].str.replace("</p>"," ")
How do I remove all the tags and clean the text_col?
CodePudding user response:
After a little bit of looking around this is what I found:
import re
x=['n</li></ul>',
'<p> bla bla bla </p>',
'bla bla </b>, </li>, <li>, </ul>',
'bla bla <strong>bla </strong>: <h3> </h3>, <ul>,<b> </p>']
for item in x:
item = re.sub("<.*?>|,|:", "", item)
item=' '.join(item.split())
print(item)
Outputs:
n
bla bla bla
bla bla
bla bla bla
I edited my answer again to refine it a little more. This should definitely answer your question. Thank regex :) .