Home > OS >  Regex bypass some punctuations from text
Regex bypass some punctuations from text

Time:03-14

I have a text like this:

text = 'hello, how are you?'

I want to extract hello, how from the text,

re.search('hello how', text)
>>> None

If you are thinking why I am not giving the comma because I am getting the text which I want to extract from some other text as the input regex and this input regex does not have punctuations while the text has. So, I want regex to bypass the punctuations for example to bypass , after the hello.

_________________________                 \     ______________________________________
| Input Regex           |      ------------\    | Text from which I have to extract  |
| (Does not have puncs) |      ------------/    | (have punctuations)                |
| For ex. (hello how)   |                 /     | For ex. (hello, how are you?)      |
_________________________                       ______________________________________

The output of the search should look like
>>> 'hello, how' (the output should have punctuations)

I cannot simply remove all of the punctuations from the text like 'hello, how are you?' as it may contain some essential punctuations which I cannot delete. I want regex only to bypass the , after the hello.

The input regex and the text can be anything, one more example:

input_regex = 'Google LLC'
text = 'Google, LLC. is an American multinational technology company.'
# so the output should be
>>> 'Google, LLC.' # with punctuations

So is there any way to bypass these punctuations without deleting all the punctuations from entire text. Thanks!

CodePudding user response:

If keeping any and all punctuation and spacing (so everything which is not a number or letter) is fine, then you can just use [^\w]* between/after the words you search for.

match = re.search(r"Google[^\w]*LLC[^\w]*", text)

CodePudding user response:

I split words and find input start with the same text.

function deepSearch(input, text) {
  const [chunkInput, chunkText] = [input.split(' '), text.split(' ')];
  
  for (let i = 0; i < chunkInput.length;   i) {
    if (!chunkText[i].startsWith(chunkInput[i])) {
      return false;
    }
  }
  return true;
}

CodePudding user response:

You could automatically modify the search pattern by allowing a comma before each space, i.e. when searching for Google LLC you seem to actually want to search for Google,? LLC.

The question mark in RegEx means "zero or one occurrence".

A simple solution could be:

def searchWithOptionalCommas(needle, haystack):
    needle = needle.replace(' ', ',? ')
    return re.search(needle, haystack)
  • Related