Home > Back-end >  How to match multi-line text using the right regex capture group?
How to match multi-line text using the right regex capture group?

Time:01-31

I'm trying to read in a CSV and split each row using regex capture groups. The last column of the CSV has newline characters in it and my regex's second capture group seems to be breaking at the first occurrence of that newline character and not capturing the rest of the string.

Below is what I've managed to do so far. The first record always starts with ABC-, so I put that in my first capturing group and everything else after it, till the next occurrence of ABC- or end of file (if last record), should be captured by the second capturing group. The first row works as expected because there's no newline characters in it, but the rest won't.

My regex: ([A-Z1-9] )-\d*,(.*)

My test string:

ABC-1,01/01/1974,X1,Y1,Z1,"RANDOM SINLGE LINE TEXT 1",
ABC-2,01/01/1974,X2,Y2,Z2,"THIS IS
A RANDOM

MULTI LINE
TEXT 2",
ABC-3,01/01/1974,X3,Y3,Z3,"THIS IS

ANOTHER RANDOM
MULTI LINE TEXT",

Expected result is:

3 matches

Match 1:

Group 1: ABC-1,

Group 2: 01/01/1974,X1,Y1,Z1,"RANDOM SINLGE LINE TEXT 1",

Match 2:

Group 1: ABC-2,

Group 2: 01/01/1974,X2,Y2,Z2,"THIS IS

A RANDOM

MULTI LINE

TEXT 2",

Match 3:

Group 1: ABC-3,

Group 2: 01/01/1974,X3,Y3,Z3,"THIS IS

ANOTHER RANDOM

MULTI LINE TEXT",

enter image description here

CodePudding user response:

You can try to limit the second group by a looking-ahead assertion:

(ABC-\d ,)(.*?(?=^ABC|\z))

Demo here.

CodePudding user response:

You can use

^([A-Z] -\d ),(.*(?:\n(?![A-Z] -\d ,).*)*)

See the regex demo. Only use it with the multiline flag (if it is not Ruby, as ^ already matches line start positions in Ruby).

Details:

  • ^ - start of a line
  • ([A-Z] -\d ) - Group 1: one or more uppercase ASCII letters and then - and one or more digits
  • , - a comma
  • (.*(?:\n(?![A-Z] -\d ,).*)*) - Group 2:
    • .* - the rest of the line
    • (?:\n(?![A-Z] -\d ,).*)* - zero or more lines that do not start with one or more uppercase ASCII letters and then - and one or more digits a comma
  • Related