Home > Blockchain >  how should i find specific car's name by regex?
how should i find specific car's name by regex?

Time:11-19

I need to take out the cars' full name from some text like this using regex. but something ridiculous happens. the regex phrase ignores the (Mercedes-Benz) it does not feel like it I assume!!! this is the text:

'''
<span >Ford<!-- --> <!-- -->F-150</span>, <span >Mercedes-Benz<!-- --> <!-- -->C-Class</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Nissan<!-- --> <!-- -->Altima</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->Fusion</span>, <span >Toyota<!-- --> <!-- -->Tacoma</span>, <span >Kia<!-- --> <!-- -->Optima</span>, <span >Ford<!-- --> <!-- -->Fusion</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Nissan<!-- --> <!-- -->Rogue</span>
'''

and this is the regex:

'''
\>([A-Z]\w )\<\!. ?\>. ?. ?\>(\w ). ?
'''

CodePudding user response:

# use regex to find all the make-model strings
import re

s = """<span >Ford<!-- --> <!-- -->F-150</span>, <span >Mercedes-Benz<!-- --> <!-- -->C-Class</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Nissan<!-- --> <!-- -->Altima</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->Fusion</span>, <span >Toyota<!-- --> <!-- -->Tacoma</span>, <span >Kia<!-- --> <!-- -->Optima</span>, <span >Ford<!-- --> <!-- -->Fusion</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Ford<!-- --> <!-- -->F-150</span>, <span >Nissan<!-- --> <!-- -->Rogue</span>
"""

reg = re.compile(r'<span >(.*?)</span>')

# remove the tags
cars = [x.replace("<!-- --> <!-- -->"," ") for x in reg.findall(s)]

print(cars)
print(len(cars))

Result:

['Ford F-150', 'Mercedes-Benz C-Class', 'Ford F-150', 'Nissan Altima', 'Ford F-150', 'Ford Fusion', 'Toyota Tacoma', 'Kia Optima', 'Ford Fusion', 'Ford F-150', 'Ford F-150', 'Ford F-150', 'Nissan Rogue']
13

CodePudding user response:

It's because your regex do not match the - char (indeed, \w is equivalent to [a-zA-Z0-9_]).

You can instead use this regex:

\>([A-Z][\w-] )\<\!. ?\>. ?. ?\>(\w ). ?
  • Related