Home > Enterprise >  Regex match all words except those between quotes
Regex match all words except those between quotes

Time:11-07

In this example I want to select all words, except those between quotes (i.e. "results", "items", "packages", "settings" and "build_type", but not "compiler.version").

results[0].items[0].packages[0].settings["compiler.version"] 
results[0].items[0].packages[0].settings.build_type

Here's what I know: I can target all words with

[a-z_] 

and then target what's in between quotes with this:

(?<=\")[\w.] (?=\")

Is there any way to match the difference between the results of the first and second regex? (i.e. words except if they are surrounded by double quotes)

Here's a regex playground with the example for convenience

CodePudding user response:

You can match strings between double quotes and then match and capture words optionally followed with dot separated words:

list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I)))

See the regex demo. Details:

  • "[^"]*" - a " char, zero or more chars other than " and then a " char
  • | - or
  • ([a-z_]\w*(?:\.[a-z_]\w*)*) - Group 1: a letter or underscore followed with zero or more word chars and then zero or more sequences of a . and then a letter or underscore followed with zero or more word chars.

See the Python demo:

import re
text = 'results[0].items[0].packages[0].settings["compiler.version"] '
print(list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I))))
# => ['results', 'items', 'packages', 'settings']

The re.ASCII option is used to make \w match [a-zA-Z0-9_] without accounting for Unicode chars.

CodePudding user response:

I believe this is the cleaner/simpler version of the solution you were searching for:

(?<!\")\b[a-z_] \b(?!\")

Here's a demo

Please let me know if this was helpful/if this was what you wanted!

CodePudding user response:

A word is not within a double-quoted substring if and only it is followed in the string by an even number of double-quotes (assuming the string is properly formatted and therefore contains an even number of double-quotes). You can use the following regular expression to match strings that are not contained within double-quoted substrings.

[a-z_] (?=(?:(?:[^\"\n]*\"){2})*[^\"\n]*\n)

Demo

The regular expression can be broken down as follows (alternatively, hover the cursor over each part of the expression at the link to obtain an explanation of its function).

[a-z_]          # match one or more of the indicated characters
(?=             # begin a positive lookahead
  (?:           # begin an outer non-capture group
    (?:         # begin an inner non-capture group
      [^\"\n]*  # match zero or more characters other than " and \n 
      \"        # match "
    ){2}        # end inner non-capture group and execute twice
  )*            # end outer non-capture group and execute zero or more times
  [^\"\n]*      # match zero or more characters other than " and \n 
  \n            # match a newline
)               # end positive lookahead

\n should be replaced by (?:\n|$) if the last line may not have a line terminator.

  • Related