I want to get a specific text from a text.
TEXT
test="<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body><div dir="ltr"><p>test test test</p<p><ahref="https://test.com/users/confirmationconfirmation_token=XXXXXX">https://test.com/users/confirmation?confirmation_token=XXXXXX</a></p>
<p>Link ile ilgili sorun yaşıyorsanız, kopyalayıp tarayıcınıza da yapıştırabilirsiniz.</p><p>Saygılarımızla,</p<p>test test test</p></div></body></html>"
this code is string variable. not html
i want to get this text "https://test.com/users/confirmation?confirmation_token=XXXXXX" but (token=XXXXXX) this part changes every time.
Can I get only the text I mentioned above with any method? Even though I only take the xxxxx part it's enough for me
CodePudding user response:
SOLUTİON
from bs4 import BeautifulSoup as bf
x = response['items']['body']
soup = bf(x,'html.parser')
soup.body.a.text
CodePudding user response:
You can use regular expressions to solve your problem
import re
test = """<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body><div dir="ltr">
<p>test test test</p<p><ahref="https://test.com/users/confirmationconfirmation_token=XXXXXX">https
://test.com/users/confirmation?confirmation_token=XXXXXX</a></p>
<p>Link ile ilgili sorun yaşıyorsanız, kopyalayıp tarayıcınıza da yapıştırabilirsiniz.</p><p>S
aygılarımızla,</p<p>test test test</p></div></body></html>"""
pattern = 'confirmation_token=(.*?)<'
find_list = re.findall(pattern, test)
print(find_list)
"""
['XXXXXX']
"""