Home > Software engineering >  Regex: match string between unescaped quotes [closed]
Regex: match string between unescaped quotes [closed]

Time:10-05

I want to extract the first string from a sequence of comma-separated strings. This is part of a log parser, so I have no control over the input. To avoid problems in the future, this should handle nested quotation or special symbols (e.g., backslashes).

Similar question have come up a lot on SO and I have tried many many regex variants. Unfortunately previous solutions (that I found) seem to cover only a subset of possible inputs. So here is a number of the more complicated inputs in ruby code:

re = /.../ # a regex
s1 = '"some string", "other string", "third string"'
s2 = '"hello \"world\"", "other string", "third string"'
s3 = '"It says \"It says \\\"No\\\"\"", "other string", "third string"'
s4 = '"C:\\dir\\", "other string", "third string"'
s5 = '"may include \",\" etc", "other string", "third string"'
s1.scan(re) 
s2.scan(re)
s3.scan(re)
s4.scan(re)
s5.scan(re)

Expected output

  • s1: 'some string'
  • s2: 'hello \"world\"'
  • s3: 'It says \"It says \\\"No\\\"\"'
  • s4: 'C:\\dir\\'
  • s5: 'may include \",\" etc'

My two solutions so far are:

  • re1 = /"((?:[^"\\]|\\.)*)"/ -- Does not match s3 and s4
  • re2 = /"(.*?)(?<!\\)"/ -- Does not match s4

Can someone do better than me?

EDIT: I added s5 to clarify, why something like /"([^,]*)",/ would not work.

CodePudding user response:

re = /"(.*?)(?<!\\)(\\\\)*"/

Find an opening quote, match everything that follows until a quote that is preceded by eny even amount of backslashes that is not preceded by another backslash. Meaning it will not match any odd number of backslashes.

The question mark after .* ensures it will match the first closing quote, not the last closing quote.

After this, just take the first match.

Here's the test-cases on rubular: https://rubular.com/r/zrObVOHlnezQyS

CodePudding user response:

It is instructive to see how Ruby interprets the sample strings:

s1 = '"some string", "other string", "third string"'
  #=> "\"some string\", \"other string\", \"third string\""
s2 = '"hello \"world\"", "other string", "third string"'
  #=> "\"hello \\\"world\\\"\", \"other string\", \"third string\""
s3 = '"It says \"It says \\\"No\\\"\"", "other string", "3rd string"'
  #=> "\"It says \\\"It says \\\\\"No\\\\\"\\\"\", \"other string\", \"3rd string\""
s4 = '"C:\\dir\\", "other string", "third string"'
  #=> "\"C:\\dir\\\", \"other string\", \"third string\""
s5 = '"may include \",\" etc", "other string", "third string"'
  #=> "\"may include \\\",\\\" etc\", \"other string\", \"third string\""

Consider also

s = '"\""'
  #=> "\"\\\"\""
s.chars
  #=> ["\"", "\\", "\"", "\""]

I will give two ways of solving the problem: using a regular expression and a stack.

Use a regular expression

r = /\A(?:[^",]*"[^"]*")*[^",]*/
'"some string", "other string", "third string"'[r]
  #=> "\"some string\""
'"hello \"world\"", "other string", "third string"'[r]
  #=> "\"hello \\\"world\\\"\""
'"It says \"It says \\\"No\\\"\"", "other string", "third string"'[r]
  #=> "\"It says \\\"It says \\\\\"No\\\\\"\\\"\""
'"C:\\dir\\", "other string", "third string"'[r]
  #=> "\"C:\\dir\\\""
'"may include \",\" etc", "other string", "third string"'[r]
  #=> "\"may include \\\""
'"abc,def"ghi,jkl'[r]
  #=>"\"abc,def\"ghi"

Start your engine!

Note that the "expected output" for s5 shown in the question is incorrect.

The regular expression can be written in free-spacing mode to make it self-documenting.

r = /
    \A        # match beginning of string
    (?:       # begin a non-capture group
      [^",]*  # match zero or more (*) chars other than " and ,
      "       # match " 
      [^"]*   # match zero or more (*) chars other than "
      "       # match " 
    )*        # end non-capture group and execute it >= 0 times
    [^",]*    # match zero or more (*) chars other than " and ,
    /x        # free-spacing regex definition mode

Use a stack

def doit(str)
  stack = []
  str.each_char.with_object('') do |c,a|
    case c
    when '"'
      if stack.last == c
        stack.pop
      else
        stack << c
      end
    when ','
      if stack.empty?
        break a
      else
        stack << c
      end
    else
      stack << c
    end
  end
end
doit '"some string", "other string", "third string"'
  #=> "\"some string\""
doit '"hello \"world\"", "other string", "third string"'
  #=> "\"hello \\\"world\\\"\""
doit '"It says \"It says \\\"No\\\"\"", "other string", "third string"'
  #=> "\"It says \\\"It says \\\\\"No\\\\\"\\\"\""
doit '"It says \"It says \\\"No\\\"\"", "other string", "third string"'
  #=> "\"It says \\\"It says \\\\\"No\\\\\"\\\"\""
doit '"C:\\dir\\", "other string", "third string"'
  #=> ​"\"C:\\dir\\\""​
doit '"may include \",\" etc", "other string", "third string"'
  #=> "\"may include \\\""
doit '"abc,def"ghi,jkl'
  #=> "\"abc,def\"ghi"
  • Related