Home > Blockchain >  Python re.search to regex and extract from each item in a list
Python re.search to regex and extract from each item in a list

Time:08-30

I have a list of html tags from a Beautiful Soup output. I want to extract the text within each tag and place into a list (spec_names).

li_tags = [<li>Brand: STIHL</li>, <li>Product: Chainsaw</li>,<li>Bar Length: 18 inch</li>, <li>Chain Brake: Yes</li>, <li>Weight: 14 pound</li>, <li>PoweredBy: Gas</li>]

I thought this would do it:

pattern = r'(?<=\<li\>). ?(?=\:)'
spec_names=[]
for x  in li_tags:
    spec_names.append(re.search(pattern,x))

Also thought this would do it:

pattern = r'(?<=\<li\>). ?(?=\:)'
spec_names=[]
spec_names= [re.search(pattern,x) for x in li_tags]

There is a lot of online help checking to see if each list item is a match, but I am wanting to extract the match from inside each list item. The end result would have spec_names as :

['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'Powered By']

I am not looking for a function, but procedural steps. Thank you in advance.

CodePudding user response:

You can use re.findall:

import re

li_tags = ["<li>Brand: STIHL</li>", "<li>Product: Chainsaw</li>", "<li>Bar Length: 18 inch</li>", "<li>Chain Brake: Yes</li>", "<li>Weight: 14 pound</li>", "<li>PoweredBy: Gas</li>"]

pattern = r'(?<=\<li\>). ?(?=\:)'
spec_names = []
for li_tag in li_tags:
    spec_names  = re.findall(pattern, li_tag)

print(spec_names)

Output:

['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'PoweredBy']

CodePudding user response:

You don't usually use regex to parse text from beautifulsoup tags. Use .text property:

from bs4 import BeautifulSoup

html_doc = """\
<li>Brand: STIHL</li>
<li>Product: Chainsaw</li>
<li>Bar Length: 18 inch</li>
<li>Chain Brake: Yes</li>
<li>Weight: 14 pound</li>
<li>PoweredBy: Gas</li>"""

soup = BeautifulSoup(html_doc, "html.parser")

li_tags = soup.select("li")

spec_names = [tag.text.split(":")[0] for tag in li_tags]
print(spec_names)

Prints:

['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'PoweredBy']
  • Related