This question is restricted to strings devoid of whitespace, written intentionally for a human to read them.
I don't care about NUL or other characters that one would not find in a piece of text set down for human consumption.
Also, I don't care about "pathological" cases such as
#!/usr/bin/env perl
use strict; use warnings;
use feature 'say';
say 'dog', "\r", 'rat';
say 'a', "\b", 'z';
This question would be useful, for instance, for generating nicely centered lines of text, when the text is not all ASCII.
In the perl script below, we look first at strings that take up 1 column, then 2 columns, 3 columns, etc.
As we see from running this code,
neither the number of bytes, nor the length of an array created by splitting a string at \B
,
reliably tells us how many columns a string or a character will take up when printed.
Is there a way to get this number?
#!/usr/bin/env perl
use strict; use warnings;
use feature 'say';
while(<DATA>)
{
say '------------------------------------------';
print;
$_=~s/\s//g;
my@array=split /\B/,$_;
say length $_,' bytes, ',scalar@array,' components';
}
__DATA__
a
é
ø
ü
α
ά
∩
⊃
≈
≠
好
üb
üü
dog
Voß
café
Schwiizertüütsch
CodePudding user response:
The number of columns used by a terminal to print text is the number of "characters" printed.
Those are the logical Unicode characters, extended grapheme clusters. So all that is needed is to break the input into characters in a way that respects Unicode.
Then we also need to enable the Unicode support for the program (and this is where the sample program from the question fails). An example
use warnings;
use strict;
use feature 'say';
use utf8;
use open qw(:std :encoding(UTF-8));
while(<DATA>)
{
s/\s//g;
my @chars = split '';
my @egc = /(\X)/g;
say "$_\t", 0 @chars, " chars (split), ", 0 @egc, " chars (regex, \\X)";
}
__DATA__
a
é
ø
ü
α
ά
∩
⊃
≈
≠
好
üb
üü
dog
Voß
café
Schwiizertüütsch
The utf8 pragma is there since the source file itself has Unicode characters in it while the open pragma takes care of the standard streams. The \X
is one way to match a logical character, and also see \b{gcb}
on the same page.
The capture (\X)
isn't needed here since we want all that's matched, so my @egc = /\X/g;
is fine. But it doesn't hurt and if there is more in the pattern one may need it so I put ()
in.
Please excuse my manners with 0 @ary
for array size as I'm trying to fit a line of code in display width for easier reading; by all means one should use scalar @ary
for this.
With the addition of the pragmas above the code in the question works well, and a statement from length I find instructive
Returns the length in characters of the value of EXPR.
...
Like all Perl character operations,length
normally deals in logical characters, not physical bytes.
(original emphasis)
The program above prints for me
a 1 chars (split), 1 chars (regex, \X) é 1 chars (split), 1 chars (regex, \X) ø 1 chars (split), 1 chars (regex, \X) ü 1 chars (split), 1 chars (regex, \X) α 1 chars (split), 1 chars (regex, \X) ά 1 chars (split), 1 chars (regex, \X) ∩ 1 chars (split), 1 chars (regex, \X) ⊃ 1 chars (split), 1 chars (regex, \X) ≈ 1 chars (split), 1 chars (regex, \X) ≠ 1 chars (split), 1 chars (regex, \X) 好 1 chars (split), 1 chars (regex, \X) üb 2 chars (split), 2 chars (regex, \X) üü 2 chars (split), 2 chars (regex, \X) dog 3 chars (split), 3 chars (regex, \X) Voß 3 chars (split), 3 chars (regex, \X) café 4 chars (split), 4 chars (regex, \X) Schwiizertüütsch 16 chars (split), 16 chars (regex, \X)