Home > Back-end >  Matching everything except for a character followed by a newline
Matching everything except for a character followed by a newline

Time:01-02

This seems like a simple match, but I'm unable to figure out how to match all text that starts with a known block of text and ends with a semicolon newline. What I have right now mostly works:

pattern = r'''[ ] (value \w \n)([^;] )'''

For an example section of text that allows me to parse:

   value Y1N5NALC
      1 = 'Yes'  
      5 = 'No'  
      7 = 'Not ascertained' ;
   value AGESCRN
      15 = '15 years'  
      16 = '16 years';  

However, if any of the key/value pairs contain a semicolon in the string the match fails early since the regex is looking for any semicolon. An example:

   value Y1N5NALC
      1 = 'Yes'  
      5 = 'No;Maybe'  
      7 = 'Not ascertained' ;

What I'd like to do is end the match by looking for a semicolon Optional(space or tab) newline. Using ([^;\n] ) fails since the newline gets match to the negative.

CodePudding user response:

You can use

(?sm)^  (value \w \n)(.*?);$

See the regex demo.

Details:

  • (?sm) - re.S and re.M are on
  • ^ - start of a line
  • - one or more spaces
  • (value \w \r?\n) - Group 1: value, space, one or more word chars, and and an LF line break
  • (.*?) - Group 2:
  • ; - a ;
  • $ - at the end of a line.

In case there can be CRLF endings, you need

(?sm)^  (value \w \r?\n)(.*?);\r?$
  • Related