I need to take out the cars' full name from some text like this using regex. but something ridiculous happens. the regex phrase ignores the (Mercedes-Benz) it does not feel like it I assume!!! this is the text:
'''
<span >Ford<!-- --> <!-- -->F-150</span>, <span >Mercedes-Benz<!-- --> <!-- -->C-Class</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Nissan<!-- --> <!-- -->Altima</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->Fusion</span>, <span >Toyota<!-- --> <!-- -->Tacoma</span>, <span >Kia<!-- --> <!-- -->Optima</span>, <span >Ford<!-- --> <!-- -->Fusion</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Nissan<!-- --> <!-- -->Rogue</span>
'''
and this is the regex:
'''
\>([A-Z]\w )\<\!. ?\>. ?. ?\>(\w ). ?
'''
CodePudding user response:
# use regex to find all the make-model strings
import re
s = """<span >Ford<!-- --> <!-- -->F-150</span>, <span >Mercedes-Benz<!-- --> <!-- -->C-Class</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Nissan<!-- --> <!-- -->Altima</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->Fusion</span>, <span >Toyota<!-- --> <!-- -->Tacoma</span>, <span >Kia<!-- --> <!-- -->Optima</span>, <span >Ford<!-- --> <!-- -->Fusion</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Nissan<!-- --> <!-- -->Rogue</span>
"""
reg = re.compile(r'<span >(.*?)</span>')
# remove the tags
cars = [x.replace("<!-- --> <!-- -->"," ") for x in reg.findall(s)]
print(cars)
print(len(cars))
Result:
['Ford F-150', 'Mercedes-Benz C-Class', 'Ford F-150', 'Nissan Altima', 'Ford F-150', 'Ford Fusion', 'Toyota Tacoma', 'Kia Optima', 'Ford Fusion', 'Ford F-150', 'Ford F-150', 'Ford F-150', 'Nissan Rogue']
13
CodePudding user response:
It's because your regex do not match the -
char (indeed, \w
is equivalent to [a-zA-Z0-9_]
).
You can instead use this regex:
\>([A-Z][\w-] )\<\!. ?\>. ?. ?\>(\w ). ?