How to check input string being a fixed format by regular expression-CodePudding

I am writing a program to store family members data.

input format like below,

Country Husband wife child pet

example input

Japan ken Annie may money

input the Area ,husband ,wife,child and pet's name, and split by space, I want to check that user input is right or not. I tried

( /^(. )(\s(. )){4}$/ ) ? print "good" : print "fail";

But it can only judge whether more than 5 words are entered instead of judging exactly five. Like, if input

Japan ken Annie may money hank queen

will still pass the judgmental.

Please tell me where I am doing wrong and how to fix it?

CodePudding user response：

On the string Japan ken Annie may money hank queen, your first (. ) is matching Japan ken Annie so the rest of the regular expression is able to match the four extra names without problems and matches.

The problem is that dot . matches also spaces.

A common solution for words separated by spaces (or any other delimier) is to use this expression:

^ something (?: separator something )quantifier $ # Note: don't take into account spaces

 (where 'something' cannot contain the separator)

So in your case you could write:

^\S (?:\s \S ){4}$

Where \S means : Any non whitespace character, 1 or more times

Please, note that \s matches any whitespace character (including new lines) So if you are reading the file as a whole (instead of line by line) it is advised to use \h instead (which matches horizontal whitespace characters)

^\S (?:\h \S ){4}$

If you use \s and you don't process the content line by line, your regular expression may try to match data accross several lines, which is wrong for your case.

Also, if you are reading the file as a whole, you may also need to use the m modifier

/^\S (?:\h \S ){4}$/m

(?m)^\S (?:\h \S ){4}$

So that ^ and $ match the begin and end of line (instead of being and end of string)

Consider using non-capturing groups (?:) If you don't plan to capture data.

If you plan to capture all data of the line, you may use this regex instead:

^(\S )\h (\S )\h (\S )\h (\S )\h (\S )$

CodePudding user response：

The problem is with the ., which matches the spaces that separate words. Replacing it with what an answer consists of (for example letters, numbers, and dashes) would be better.

There might be a better solution but this is what I've come up with.

/^([\w\d-] \s[\w\d-] ){4}$/

CodePudding user response：

Input data validation rarely can be achieved with simple regular expression in one step.

Please inspect the following demo code for possibility country/name to include spaces and dashes which suggested by you regular expression will not handle properly.

To avoid potential pitfalls do not use space as field separator -- names and countries possibly can include spaces/dashes -- use of , fills more natural.

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $data;
my @header = split(/,/, <DATA>);

chomp @header;

while(my $line = <DATA>) {
    chomp $line;
    my @read = split(/,/,$line);
    say "Warning: $line number of arguments is " . scalar @read
        unless @read == 5;
    $data->@{@header} = @read;
    $data->{$_} =~ /[^a-z -] /i && say "Warning: '$_ => $data->{$_}' does not look right"
        for @header;
    say Dumper($data);
}

__DATA__
Country,Husband,wife,child,pet
Japan,ken,Annie,may,money
China,Sonny,Ae-Cha,Bora,coin,hummer
South Korea,Sonny2,Ae-Cha,Bora,coin

Output sample

$VAR1 = {
          'Husband' => 'ken',
          'pet' => 'money',
          'child' => 'may',
          'wife' => 'Annie',
          'Country' => 'Japan'
        };

Warning: China,Sonny,Ae-Cha,Bora,coin,hummer number of arguments is 6
$VAR1 = {
          'Husband' => 'Sonny',
          'pet' => 'coin',
          'child' => 'Bora',
          'wife' => 'Ae-Cha',
          'Country' => 'China'
        };

Warning: 'Husband => Sonny2' does not look right
$VAR1 = {
          'Husband' => 'Sonny2',
          'pet' => 'coin',
          'child' => 'Bora',
          'wife' => 'Ae-Cha',
          'Country' => 'South Korea'
        };

CodePudding user response：

Note that if you have a category which may contain space, splitting a string on space is not a good method, like has been exemplified with "What if the country is South Korea?" Polar Bear in his answer suggests using comma as a separator instead, which would allow South Korea. Other workarounds might include quoting words with spaces and using a module that can handle quoting, such as Text::ParseWords, which is a core module in Perl.

Using Text::ParseWords:

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

my $str = qq("South Korea" Ken Barbie Mario Fido);
my @data = quotewords(" ", 0, $str);
print Dumper \@data;

$VAR1 = [
          'South Korea',
          'Ken',
          'Barbie',
          'Mario',
          'Fido'
        ];

But the main issue of counting words I think is most suitably done by splitting the string on space and counting the resulting fields. You can do this with quotewords like above, and then insert a test such as:

if (@data == 5) {
    print "Correct number of args";
} elsif (@data < 5) {
    print "Too few args";
} # etc.....

You can also manually split the string:

my @data = split ' ', $str;

A simple way to count with a regex is to match what you want to match, then assign it to a scalar context, with a little Perl magic:

my $count = () = $str =~ /\S /g;  # how many non-whitespace matches do we get?

The empty list () in the assignment will put the regex into list context and return the number of matches to the scalar to the left.

But I feel that using a single string data input is not the best way. If you have an exact number of inputs to get, why not get them individually?

use strict;
use warnings;
use Data::Dumper;
use feature 'say';

my