I have to validate the lines from a text file. The line would be something like below.
"Field1" "Field2" "Field3 Field_3.1 Field3.2" 23 3445 "Field5".
The delimiter here is a single Space(\s). If more than one space present outside of text fields, then the line should be rejected. For example,
Note : \s would be present as literal space and not as \s in the line. For easy reading I mentioned space as \s
Invalid:
"Field1"\\s\\s"Field2" "Field3 Field_3.1 Field3.2" 23\\s\\s3445 "Field5".
//two or more spaces between "Field1" and "Field2" or numeric fields 23 3445. \s would be present as literal space and not as \s
Valid
"Field1\\s\\s" "\\s\\sField2" "Field3\\s\\sField_3.1\\s\\sField3.2" 23 3445 "Field5".
//two or more spaces within third field "Field3 Field_3.1 Field3.2" or at the end/beginning of any field as in first two fields.
I created a Pattern as below to validate the Spaces in between. But it's not working as expected when there're more than two Strings and a numeric present inside a Field wrapped by Double quotes like "Field3 Field_3.1 123"
public class SpaceValidation
{
public static void main(String ar[])
{
String spacePattern_1 = "[\"^\\n]\\s{2,}?(\".*\")|\\s\\s\\d|\\d\\s\\s";
String line1 = "Field3 Field_3.1 "; // valid and pattern doesn't find it as invalid - Works as expected
String line2 = "Field3 Field_3.1 123";//Valid and but pattern find it as invalid - Not working as expected.
Pattern pattern = Pattern.compile(spacePattern_1);
Matcher matLine1 = pattern.matcher(line1);
Matcher matLine2 = pattern.matcher(line2);
if(matLine1.find())
{
sysout("Invalid Line1");
}
if(matLine2.find())
{
sysout("Invalid Line2");
}
}
I have tried another pattern given below. But due to backtracking issues reported I have to avoid the below pattern, Even this one is not working when there are more than two subfields present two or more spaces in a line.
(\".*\")\\s{2,}?(\".*\")|\\s\\s\\d|\\d\\s\\s
// * or . shouldn't be present more than once in the same condition to prevent backtracking, hence I have to use negation of \\n in the above code
Kindly let me know how I could resolve this using pattern for fields such as "field3 field3.1 123"
, which is a valid field. Thanks in advance.
EDIT:
After little bit tinkering, I narrowed down the issue to digit. The lines becomes invalid only if the third subfield is numeric ("Field 3 Field3.1 123"
). For alphabets its working fine.
Here in the pattern \\s\\s\\d
seems to be the culprit. It's that condition that flags the third subfield as invalid(numeric subfield 123). But I need that to validate numeric fields present outside of the DoubleQuotes.
CodePudding user response:
You can use
^(?:\"[^\"]*\"|\d )(?:\s(?:\"[^\"]*\"|\d ))*$
If you are using it to extract lines from a multiline document:
(?m)^(?:\"[^\"\n\r]*\"|\d )(?:\h(?:\"[^\"\n\r]*\"|\d ))*\r?$
See the regex demo.
Details:
^
- start of a string (line, if you use(?m)
orPattern.MULTILINE
)(?:\"[^\"]*\"|\d )
- either"
zero or more chars other than"
"
, or one or more digits(?:\s(?:\"[^\"]*\"|\d ))*
- zero or more sequences of\s
- a single whitespace(?:\"[^\"]*\"|\d )
- either"
zero or more chars other than"
"
, or one or more digits
$
- end of string
The second pattern contains \h
instead of \s
to only match horizontal whitespaces, [^\"\n\r]
matches any char other than "
, line feed and carriage return.
In Java:
String pattern = "^(?:\"[^\"]*\"|\\d )(?:\\s(?:\"[^\"]*\"|\\d ))*$";
String pattern = "(?m)^(?:\"[^\"\n\r]*\"|\\d )(?:\\h(?:\"[^\"\n\r]*\"|\\d ))*\r?$";