Home > Software engineering >  I want to match the words and/or that are in between evenly matched sets of double quotes
I want to match the words and/or that are in between evenly matched sets of double quotes

Time:05-03

I currently have regex that is working when using parenthesis to wrap groups

The Regex

((?<!\()\b(and|or)\b(?![\w\s]*[\)]))/gi

The String

contract type is (exhibit and a) and party name is (pearl and jam) or late payment fee is 15 and party name is (sony or sons)

The bolded and / or are what I want to match. I do not want to match the italicized. The above is currently working.


I am trying to get the above result using double quotes instead of () but have not been able to make any progress.

The Regex

((?<!\")\b(and|or)\b(?![\w\s]*[\"]))/gi

The String

contract type is "exhibit and a" and party name is "pearl and jam" or late payment fee is 15 and party name is "sony or sons"

I am getting no matches and that makes sense to me because all of my and / or are surrounded by quotes. My idea was to somehow refactor my regex to match and / or if the left side occurrence of a quote is odd or the right side occurence is even but I have not found anything that points to that being possible.

Any help will be much appreciated. I'll continue to post updates as I make progress on the regex myself.

CodePudding user response:

Your regular expression matches "and" or "or" provided the word does not precede a ")" later in the string and the characters between do not include the character "(" (i.e., that the matching "(" must precede the string, so no match should be made). That obviously cannot be extended to double (or single) quotes because the beginning and ending quotes are the same character.

What you can do is match the regular expression

"[^"]*"|\b(and|or)\b

Let's look at what is matched and captured in an example string.

'type is "exhibit and a" and name is "pearl and jam" or 15 and "sony or sons"'
         mmmmmmmmmmmmmmm mmm         mmmmmmmmmmmmmmm mm    mmm mmmmmmmmmmmmmm
                         ccc                         cc    ccc

The strings that are matched are marked with 'm's. The strings that are captured are marked with 'c's. As you see, we are only interested in the strings that are captured. We can simply disregard the matches that are not captured. You will of course have to that in code, but it should be quite simple regardless of the language that you are using.

The first match begins with the first double-quote and extends to the next double-quote. That match is not captured. The regex engine's string pointer is now between the second double quote and the space that follows. It then attempts to match that space and fails. It then successfully matches and captures "and", and so on.

Note that "[^"]*" can be replaced with ".*?". The latter reads, "match a double-quote followed by zero or more characters, lazily (?), followed by a double-quote". A lazy (non-greedy) match matches as few characters as possible.

Demo

  • Related