Home > Software engineering >  Pattern Matching to find trailing spaces outside of text fields in a line
Pattern Matching to find trailing spaces outside of text fields in a line

Time:01-05

I have to validate the lines from a text file. The line would be something like below.

"Field1" "Field2" "Field3 Field_3.1 Field3.2" 23 3445 "Field5".

The delimiter here is a single Space(\s). If more than one space present outside of text fields, then the line should be rejected. For example,

Note : \s would be present as literal space and not as \s in the line. For easy reading I mentioned space as \s

Invalid: "Field1"\\s\\s"Field2" "Field3 Field_3.1 Field3.2" 23\\s\\s3445 "Field5". //two or more spaces between "Field1" and "Field2" or numeric fields 23 3445. \s would be present as literal space and not as \s

Valid "Field1\\s\\s" "\\s\\sField2" "Field3\\s\\sField_3.1\\s\\sField3.2" 23 3445 "Field5". //two or more spaces within third field "Field3 Field_3.1 Field3.2" or at the end/beginning of any field as in first two fields.

I created a Pattern as below to validate the Spaces in between. But it's not working as expected when there're more than two Strings and a numeric present inside a Field wrapped by Double quotes like "Field3 Field_3.1 123"

public class SpaceValidation
{
   public static void main(String ar[])
   {
       String spacePattern_1 = "[\"^\\n]\\s{2,}?(\".*\")|\\s\\s\\d|\\d\\s\\s";
String line1 = "Field3  Field_3.1  "; // valid and pattern doesn't find it as invalid - Works as expected
String line2 = "Field3  Field_3.1  123";//Valid and but pattern find it as invalid - Not working as expected.
      Pattern pattern = Pattern.compile(spacePattern_1);
      Matcher matLine1 = pattern.matcher(line1);
      Matcher matLine2 = pattern.matcher(line2);
      if(matLine1.find())
      {
        sysout("Invalid Line1");
      }

      if(matLine2.find())
      {
        sysout("Invalid Line2");
      }
}

I have tried another pattern given below. But due to backtracking issues reported I have to avoid the below pattern, Even this one is not working when there are more than two subfields present two or more spaces in a line.

(\".*\")\\s{2,}?(\".*\")|\\s\\s\\d|\\d\\s\\s

 // * or . shouldn't be present more than once in the same condition to prevent backtracking, hence I have to use negation of \\n in the above code

Kindly let me know how I could resolve this using pattern for fields such as "field3 field3.1 123", which is a valid field. Thanks in advance.

EDIT: After little bit tinkering, I narrowed down the issue to digit. The lines becomes invalid only if the third subfield is numeric ("Field 3 Field3.1 123"). For alphabets its working fine.

Here in the pattern \\s\\s\\d seems to be the culprit. It's that condition that flags the third subfield as invalid(numeric subfield 123). But I need that to validate numeric fields present outside of the DoubleQuotes.

CodePudding user response:

You can use

^(?:\"[^\"]*\"|\d )(?:\s(?:\"[^\"]*\"|\d ))*$

If you are using it to extract lines from a multiline document:

(?m)^(?:\"[^\"\n\r]*\"|\d )(?:\h(?:\"[^\"\n\r]*\"|\d ))*\r?$

See the regex demo.

Details:

  • ^ - start of a string (line, if you use (?m) or Pattern.MULTILINE)
  • (?:\"[^\"]*\"|\d ) - either " zero or more chars other than " ", or one or more digits
  • (?:\s(?:\"[^\"]*\"|\d ))* - zero or more sequences of
    • \s - a single whitespace
    • (?:\"[^\"]*\"|\d ) - either " zero or more chars other than " ", or one or more digits
  • $ - end of string

The second pattern contains \h instead of \s to only match horizontal whitespaces, [^\"\n\r] matches any char other than ", line feed and carriage return.

In Java:

String pattern = "^(?:\"[^\"]*\"|\\d )(?:\\s(?:\"[^\"]*\"|\\d ))*$";
String pattern = "(?m)^(?:\"[^\"\n\r]*\"|\\d )(?:\\h(?:\"[^\"\n\r]*\"|\\d ))*\r?$";
  • Related