I have a large array that contains strings with the following format in Python
some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE]
I just need to extract the substrings that start with MATH, SCIENCE and ART. So what I'm currently using
my_str = re.findall('MATH_.*? ', some_array )
if len(my_str) > 0:
print(my_str)
my_str = re.findall('SCIENCE_.*? ', some_array )
if len(my_str) !=0:
print(my_str)
my_str = re.findall('ART_.*? ', some_array )
if len(my_str) > 0:
print(my_str)
It seems to work, but I was wondering if the findall function can look for more than one substring in the same line or maybe there is a cleaner way of doing it with another function. Thanks.
CodePudding user response:
You can use |
to match multiple different strings in a regular expression.
re.findall('(?:MATH|SCIENCE|ART)_.*? ', ...)
You could also use str.startswith
along with a list comprehension.
res = [x for x in some_array if any(x.startswith(prefix)
for prefix in ('MATH', 'SCIENCE', 'ART'))]
CodePudding user response:
You could also match optional non whitespace characters after one of the alternations, start with a word boundary to prevent a partial word match and match the trailing single space:
\b(?:MATH|SCIENCE|ART)_\S*
Or if only word characters \w
:
\b(?:MATH|SCIENCE|ART)_\w*
Example
import re
some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT SOME_VALUE']
pattern = re.compile(r"\b(?:MATH|SCIENCE|ART)_\S* ")
for s in some_array:
print(pattern.findall(s))
Output
['MATH_SOME_TEXT_AND_NUMBER ']
['SCIENCE_SOME_TEXT_AND_NUMBER ']
['ART_SOME_TEXT_AND_NUMBER ']