I want to extract the first string from a sequence of comma-separated strings. This is part of a log parser, so I have no control over the input. To avoid problems in the future, this should handle nested quotation or special symbols (e.g., backslashes).
Similar question have come up a lot on SO and I have tried many many regex variants. Unfortunately previous solutions (that I found) seem to cover only a subset of possible inputs. So here is a number of the more complicated inputs in ruby code:
re = /.../ # a regex
s1 = '"some string", "other string", "third string"'
s2 = '"hello \"world\"", "other string", "third string"'
s3 = '"It says \"It says \\\"No\\\"\"", "other string", "third string"'
s4 = '"C:\\dir\\", "other string", "third string"'
s5 = '"may include \",\" etc", "other string", "third string"'
s1.scan(re)
s2.scan(re)
s3.scan(re)
s4.scan(re)
s5.scan(re)
Expected output
- s1:
'some string'
- s2:
'hello \"world\"'
- s3:
'It says \"It says \\\"No\\\"\"'
- s4:
'C:\\dir\\'
- s5:
'may include \",\" etc'
My two solutions so far are:
re1 = /"((?:[^"\\]|\\.)*)"/
-- Does not match s3 and s4re2 = /"(.*?)(?<!\\)"/
-- Does not match s4
Can someone do better than me?
EDIT: I added s5
to clarify, why something like /"([^,]*)",/
would not work.
CodePudding user response:
re = /"(.*?)(?<!\\)(\\\\)*"/
Find an opening quote, match everything that follows until a quote that is preceded by eny even amount of backslashes that is not preceded by another backslash. Meaning it will not match any odd number of backslashes.
The question mark after .*
ensures it will match the first closing quote, not the last closing quote.
After this, just take the first match.
Here's the test-cases on rubular: https://rubular.com/r/zrObVOHlnezQyS
CodePudding user response:
It is instructive to see how Ruby interprets the sample strings:
s1 = '"some string", "other string", "third string"'
#=> "\"some string\", \"other string\", \"third string\""
s2 = '"hello \"world\"", "other string", "third string"'
#=> "\"hello \\\"world\\\"\", \"other string\", \"third string\""
s3 = '"It says \"It says \\\"No\\\"\"", "other string", "3rd string"'
#=> "\"It says \\\"It says \\\\\"No\\\\\"\\\"\", \"other string\", \"3rd string\""
s4 = '"C:\\dir\\", "other string", "third string"'
#=> "\"C:\\dir\\\", \"other string\", \"third string\""
s5 = '"may include \",\" etc", "other string", "third string"'
#=> "\"may include \\\",\\\" etc\", \"other string\", \"third string\""
Consider also
s = '"\""'
#=> "\"\\\"\""
s.chars
#=> ["\"", "\\", "\"", "\""]
I will give two ways of solving the problem: using a regular expression and a stack.
Use a regular expression
r = /\A(?:[^",]*"[^"]*")*[^",]*/
'"some string", "other string", "third string"'[r]
#=> "\"some string\""
'"hello \"world\"", "other string", "third string"'[r]
#=> "\"hello \\\"world\\\"\""
'"It says \"It says \\\"No\\\"\"", "other string", "third string"'[r]
#=> "\"It says \\\"It says \\\\\"No\\\\\"\\\"\""
'"C:\\dir\\", "other string", "third string"'[r]
#=> "\"C:\\dir\\\""
'"may include \",\" etc", "other string", "third string"'[r]
#=> "\"may include \\\""
'"abc,def"ghi,jkl'[r]
#=>"\"abc,def\"ghi"
Note that the "expected output" for s5
shown in the question is incorrect.
The regular expression can be written in free-spacing mode to make it self-documenting.
r = /
\A # match beginning of string
(?: # begin a non-capture group
[^",]* # match zero or more (*) chars other than " and ,
" # match "
[^"]* # match zero or more (*) chars other than "
" # match "
)* # end non-capture group and execute it >= 0 times
[^",]* # match zero or more (*) chars other than " and ,
/x # free-spacing regex definition mode
Use a stack
def doit(str)
stack = []
str.each_char.with_object('') do |c,a|
case c
when '"'
if stack.last == c
stack.pop
else
stack << c
end
when ','
if stack.empty?
break a
else
stack << c
end
else
stack << c
end
end
end
doit '"some string", "other string", "third string"'
#=> "\"some string\""
doit '"hello \"world\"", "other string", "third string"'
#=> "\"hello \\\"world\\\"\""
doit '"It says \"It says \\\"No\\\"\"", "other string", "third string"'
#=> "\"It says \\\"It says \\\\\"No\\\\\"\\\"\""
doit '"It says \"It says \\\"No\\\"\"", "other string", "third string"'
#=> "\"It says \\\"It says \\\\\"No\\\\\"\\\"\""
doit '"C:\\dir\\", "other string", "third string"'
#=> "\"C:\\dir\\\""
doit '"may include \",\" etc", "other string", "third string"'
#=> "\"may include \\\""
doit '"abc,def"ghi,jkl'
#=> "\"abc,def\"ghi"