I have a list of html tags from a Beautiful Soup output. I want to extract the text within each tag and place into a list (spec_names).
li_tags = [<li>Brand: STIHL</li>, <li>Product: Chainsaw</li>,<li>Bar Length: 18 inch</li>, <li>Chain Brake: Yes</li>, <li>Weight: 14 pound</li>, <li>PoweredBy: Gas</li>]
I thought this would do it:
pattern = r'(?<=\<li\>). ?(?=\:)'
spec_names=[]
for x in li_tags:
spec_names.append(re.search(pattern,x))
Also thought this would do it:
pattern = r'(?<=\<li\>). ?(?=\:)'
spec_names=[]
spec_names= [re.search(pattern,x) for x in li_tags]
There is a lot of online help checking to see if each list item is a match, but I am wanting to extract the match from inside each list item. The end result would have spec_names as :
['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'Powered By']
I am not looking for a function, but procedural steps. Thank you in advance.
CodePudding user response:
You can use re.findall:
import re
li_tags = ["<li>Brand: STIHL</li>", "<li>Product: Chainsaw</li>", "<li>Bar Length: 18 inch</li>", "<li>Chain Brake: Yes</li>", "<li>Weight: 14 pound</li>", "<li>PoweredBy: Gas</li>"]
pattern = r'(?<=\<li\>). ?(?=\:)'
spec_names = []
for li_tag in li_tags:
spec_names = re.findall(pattern, li_tag)
print(spec_names)
Output:
['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'PoweredBy']
CodePudding user response:
You don't usually use regex to parse text from beautifulsoup
tags. Use .text
property:
from bs4 import BeautifulSoup
html_doc = """\
<li>Brand: STIHL</li>
<li>Product: Chainsaw</li>
<li>Bar Length: 18 inch</li>
<li>Chain Brake: Yes</li>
<li>Weight: 14 pound</li>
<li>PoweredBy: Gas</li>"""
soup = BeautifulSoup(html_doc, "html.parser")
li_tags = soup.select("li")
spec_names = [tag.text.split(":")[0] for tag in li_tags]
print(spec_names)
Prints:
['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'PoweredBy']