Home > database >  How to extract 2 character words from string in perl
How to extract 2 character words from string in perl

Time:03-28

I assume some sort of regex would be used to accomplish this?

I need to get it where each word consists of 2 or more characters, start with a letter, and the remaining characters consist of letters, digits, and underscores.

This is the code I currently have, although it isn't very close to my desired output:

while (my $line=<>) {
  # remove leading and trailing whitespace
  $line =~ s/^\s |\s $//g;
  $line = lc $line;
  @array = split / /, $line;
  foreach my $a (@array){
    $a =~ s/[\$#@~!&*()\[\];.,:?^ `\\\/] //g;
    push(@list, "$a");
  }
}

A sample input would be:

use strict;
# This line will print a hello world line.
print "Hello world!\n";
exit 0;

And the desired output would be (alphabetical order):

bin
exit 
hello
hello
line
perl
print
print
strict
this
use
usr
will
world

CodePudding user response:

my @matches = $string =~ /(\b[a-z][a-z0-9_] )/ig;

If case-insensitive operation need be applied only to a subpattern, can embed it

/... ((?i)\b[a-z][a-z0-9_] ) .../

(or, it can be turned off after the subpattern with (?-i))

That [a-zA-Z0-9_] goes as \w, a "word character", if that's indeed exactly what is needed.


Note that the above regex picks words as required without a need to first split the line on space, what is in the shown program. Can apply it on the whole line (or on the whole text for that matter), perhaps after the shown stripping of the various special characters.

There is a question of some other cases -- how about hyphens? Apostrophes? Tilde? Those aren't found in identifiers, while this appears to be intended to process programming text, but comments are included; what other legitimate characters there may be?


Note that the shown split / /, $line splits on exactly that one space. Better is split /\s /, $line -- or, better yet, use the split's special pattern split ' ', $line: any number of any whitespace, and it discards leading and trailing spaces.

CodePudding user response:

Curiously, this can be done with one of those long, typical Perl one-liners

$ perl -lwe'print for sort grep /^\pL/ && length > 1, map { split /\W / } map lc, <>' a.txt
bin
exit
hello
hello
line
line
perl
print
print
strict
this
use
usr
will
world
world

Lets go through that and see what we can learn. This line reads from right to left.

  • a.txt is the argument file to read
  • <> is the diamond operator, reading the lines from the file. Since this is list context, it will exhaust the file handle and return all the lines.
  • map lc, short for map { lc($_) } will apply the lc function on all the lines and return the result.
  • map { split /\W / } is a multi-purpose operation. It will remove the unwanted characters (the non-word characters), and also split the line there, and return a list of all those words.
  • grep /^\pL/ && length > 1 sorts out strings that begin with a letter \pL and are longer than 1 and returns them.
  • sort sorts alphabetically the list coming in from the right and returns it left
  • for is a for-loop, applied to the incoming list, in the post-fix style.
  • print is short for print $_, and it will print once for each list item in the for loop.
  • The -l switch in the perl command will "fix" line endings for us (remove them from input, add them in output). This will make the print pretty at the end.

I won't say this will produce a perfect result, but you should be able to pick up some techniques to finish your own program.

  • Related