Home > database >  Filter words in Regular expression
Filter words in Regular expression

Time:12-16

So, quite recently I have been introduced to regular expressions in Python and I've come across with some code online to filter words from a string list that are contained on other substrings.

def Filter(string, substr):
    return [str for str in string
    if re.match(r'[^\d] |^', str).group(0) in substr]

It seems pretty straightforward and it works pretty well for my specific problem I'm meeting, but I really can't wrap my head around the meaning of it and how it is working. It just seems very confusing. Can anyone explain to me as if I was a baby or something? My coding skills are not that great, and I'm still a rookie.

Just to be clear, the code works, and I'm happy to move on, I just don't understand this bit.

CodePudding user response:

[^\d] matches any character that isn't a numeric digit; this can also be written as \D.

after a pattern means to match any sequence of characters that match the pattern, so [^\d] matches a sequence of non-digits.

| separates alternative patterns to match.

The second alternative ^ matches the beginning of the string. Every string will match this. I think they use this just to avoid the match failing, so that you can always call .group(0) on the result. They could accomplish the same thing by changing to * in the first alternative, since this means that the matched sequence can be 0 repetitions.

re.match() looks for a match of the regexp at the beginning of the argument string. And .group(0) returns what was matched by the entire regexp. So this whole thing returns the initial sequence of non-digits in str.

Finally, the list comprehension returns any of the items in strings whose initial sequence of non-digits is in substr.

With the simplifications I mentioned above, this can be rewritten:

def Filter(string, substr):
    return [item for item in string
            if re.match(r'\D*', item).group(0) in substr]

Note that if any of the items begin with a digit, the result of the regexp will be an empty string, and an empty string is a substring of every string. So these items will be included in the filter result. I suspect this is not the intended result.

CodePudding user response:

I will try to to explain this for you.

So basically we are creating a method named "filter" and passing two arguments i.e "string (to be searched in)" and "substring (to be searched for)". Then we are using re.match inside a python return function along with an if condition within a for loop (the for loop helps us traverse through the main string one by one). As for: (r'[^\d] |^': this is a regular expression pattern where, \d is regex pattern for digit and means at least one or more and finally they are closed within () that means the group that you want to capture.

re.match: re.match is a function that searches only from the beginning of the string and returns the matched object (if found). However, if the substring is found somewhere in the middle then it will simply return none.

  • Related