In this example I want to select all words, except those between quotes (i.e. "results", "items", "packages", "settings" and "build_type", but not "compiler.version").
results[0].items[0].packages[0].settings["compiler.version"]
results[0].items[0].packages[0].settings.build_type
Here's what I know: I can target all words with
[a-z_]
and then target what's in between quotes with this:
(?<=\")[\w.] (?=\")
Is there any way to match the difference between the results of the first and second regex? (i.e. words except if they are surrounded by double quotes)
Here's a regex playground with the example for convenience
CodePudding user response:
You can match strings between double quotes and then match and capture words optionally followed with dot separated words:
list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I)))
See the regex demo. Details:
"[^"]*"
- a"
char, zero or more chars other than"
and then a"
char|
- or([a-z_]\w*(?:\.[a-z_]\w*)*)
- Group 1: a letter or underscore followed with zero or more word chars and then zero or more sequences of a.
and then a letter or underscore followed with zero or more word chars.
See the Python demo:
import re
text = 'results[0].items[0].packages[0].settings["compiler.version"] '
print(list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I))))
# => ['results', 'items', 'packages', 'settings']
The re.ASCII
option is used to make \w
match [a-zA-Z0-9_]
without accounting for Unicode chars.
CodePudding user response:
I believe this is the cleaner/simpler version of the solution you were searching for:
(?<!\")\b[a-z_] \b(?!\")
Please let me know if this was helpful/if this was what you wanted!
CodePudding user response:
A word is not within a double-quoted substring if and only it is followed in the string by an even number of double-quotes (assuming the string is properly formatted and therefore contains an even number of double-quotes). You can use the following regular expression to match strings that are not contained within double-quoted substrings.
[a-z_] (?=(?:(?:[^\"\n]*\"){2})*[^\"\n]*\n)
The regular expression can be broken down as follows (alternatively, hover the cursor over each part of the expression at the link to obtain an explanation of its function).
[a-z_] # match one or more of the indicated characters
(?= # begin a positive lookahead
(?: # begin an outer non-capture group
(?: # begin an inner non-capture group
[^\"\n]* # match zero or more characters other than " and \n
\" # match "
){2} # end inner non-capture group and execute twice
)* # end outer non-capture group and execute zero or more times
[^\"\n]* # match zero or more characters other than " and \n
\n # match a newline
) # end positive lookahead
\n
should be replaced by (?:\n|$)
if the last line may not have a line terminator.