Home > Back-end >  Regex matching every single blank space
Regex matching every single blank space

Time:01-25

I have this snippet regex part used in python:

(?<!\d)(\d??\d )?(hours|hour|hrs|hr|h)?( and |, )??(?<!\d)(\d??\d )?(minutes|minute|mins|min|m)?( and |, )?(?<!\d)(\d??\d )?(seconds|second|secs|sec|s)?

and I used regex101 to check my regex. It does not work how I wanted it to be. For some reason, when I press Enter to go to a new line, it has a match.

For example, I have 10 blank lines and regex101 found 10 matches, all matched at the beginning of each of the blank lines. Also, if I spammed space on a blank line, it matches every single blank space. The match information on the right side of the webpage didn't help because it only showed "null".

I have tried out both \s and literal space as shown in the regex code above, both has same results, 10 blank lines, 10 matches found. Couldn't think of why it matches every single blank space and a solution for it.

Full piece of regex:

(([0][0-9]|[1][0-9]|[2][0-3]) ?:([0][0-9]|[1][0-9]|[2][0-9]|[3][0-9]|[4][0-9]|[5][0-9]) ?:([0][0-9]|[1][0-9]|[2][0-9]|[3][0-9]|[4][0-9]|[5][0-9]) ?)|((?<!\d|\w)(\d?\d\s) ?((hours)|(hour)|(hrs)|(hr)|(h)))?((\sand\s)|(,\s?))?((?<!\d|\w)(\d?\d\s) ?((minutes)|(minute)|(mins)|(min)|(m)))?((\sand\s)|(,\s?))?((?<!\d|\w)(\d?\d\s) ?((seconds)|(second)|(secs)|(sec)|(s)))?

If I typed "a", regex101 matched "" before the "a" and if I typed "5 hours", regex101 matched "5 hours" and "" after the "5 hours". In VS Code, if the user did not type anything, it fails to raise ValueError which then re-prompts the user for the time, same thing happens if the user types a space or random stuff like "asdfasdf". I have tried to catch it with:

if test.group() == "":
    raise ValueError

It still fails to raise ValueError

My desired results for the snippet part includes (only 12 shown, a ton of combinations can be made) (Being matched as 1 entire group instead of 2 groups for "5 hours" and 8 groups for "5 hours, 5 minutes and 5 seconds"):

  1. 5 hours, 5 minutes and 5 seconds
  2. 5 hours and 5 minutes
  3. 5 hours and 5 seconds
  4. 5 hours, 5 minutes, 5 seconds
  5. 5 hrs, 5 mins, 5 secs
  6. 5 hours
  7. 5 minutes
  8. 5 seconds
  9. 5 hours and 5 minutes
  10. 5 minutes and 5 seconds
  11. 5 minutes, 5 seconds
  12. 5 hours, 5 seconds

CodePudding user response:

You can match:

  • hours with (\d \s(?:hours|hour|hrs|hr|h))?
  • minutes with (\d \s(?:minutes|minute|mins|min|m))?
  • seconds with (\d\s(?:seconds|second|secs|sec|s))?
  • separators with (?:, | and )

If you combine these together, you get the regex you were looking for:

(\d \s(?:hours|hour|hrs|hr|h))?(?:, | and )?(\d \s(?:minutes|minute|mins|min|m))?(?:, | and )?(\d\s(?:seconds|second|secs|sec|s))?

Then you need to extract your groups. You can check the following Python code:

import re

strings = [
    '5 hours, 5 minutes and 5 seconds',
    '5 hours and 5 minutes',
    '5 hours and 5 seconds',
    '5 hours, 5 minutes, 5 seconds',
    '5 hrs, 5 mins, 5 secs',
    '5 hours',
    '5 minutes',
    '5 seconds',
    '5 hours and 5 minutes',
    '5 minutes and 5 seconds',
    '5 minutes, 5 seconds',
    '5 hours, 5 seconds'
]

pattern = r'(\d \s(?:hours|hour|hrs|hr|h))?(?:, | and )?(\d \s(?:minutes|minute|mins|min|m))?(?:, | and )?(\d\s(?:seconds|second|secs|sec|s))?'

print([re.search(pattern, string).groups() for string in strings])

#for string in strings:
#    match = re.search(pattern, string)
#    if match:
#        print(match.group())

Check the regex demo and the python demo.

CodePudding user response:

That's because your regex allowing to match nothing. Let's rewrite your regex:

(?<!\d)(\d??\d )?(hours|hour|hrs|hr|h)?( and |, )??(?<!\d)(\d??\d )?(minutes|minute|mins|min|m)?( and |, )?(?<!\d)(\d??\d )?(seconds|second|secs|sec|s)?

Into something easier to look:

(...)?(...)?(...)??(...)? ...

Every group you are making it optional, so it allow to match nothing string.

Not just it matching empty lines, given a line of space , it will create as many match as how many spaces there are.

Solution is instead of making (...)?(...)?, change it into altenate (...)|(...)

How to do it is depend on your requirement. You have given no requirement, so I'm giving an example based on my guess:

(?<!\d)(\d?\d )|(hours|hour|hrs|hr|h)|( and |, )??(?<!\d)(\d??\d )|(minutes|minute|mins|min|m)|( and |, )|(?<!\d)(\d??\d )|(seconds|second|secs|sec|s)
  • Related