I have a file with words and I need to print only the lines that are less than or equal to 4 characters but I'm having trouble with my code. There is other text on the end of the lines but I shortened it for here.
file:
John Doe
Jane Doe
Mark Smith
Abigail Smith
Bill Adams
What I want to do is print the names that have less than 4 characters.
What I've tried:
awk '$1 <= 4 {print $1}' inputfile
What I'm hoping to get:
John
Jane
Mark
Bill
So far, I've got nothing. Either it prints out everything, with no length restrictions or it doesn't even print anything at all. Could someone take a look at this and see what they think? Thanks
CodePudding user response:
First, let understand why
awk '$1 <= 4 {print $1}' inputfile
gives you whole inputfile
, $1 <= 4
is numeric comparison, so this prompt GNU AWK
to try to convert first column value to numeric value, but what is numeric value of say
John
? As GNU AWK
manual Strings And Numbers put it
A string is converted to a number by interpreting any numeric prefix of the string as numerals(...)Strings that can’t be interpreted as valid numbers convert to zero.
Therefore numeric value for John
from GNU AWK
point of view is zero.
In order to get desired output you might use length
function which returns number of characters as follows
awk 'length($1)<=4{print $1}' inputfile
or alternatively pattern matching from 0 to 4 characters that is
awk '$1~/^.{0,4}$/{print $1}' inputfile
where $1~
means check if 1st field match, .
denotes any character, {0,4}
from 0 to 4 repetitions, ^
begin of string, $
end of string (these 2 are required as otherwise it would also match longer string, as they do contain substring .{0,4}
)
Both codes for inputfile
John Doe
Jane Doe
Mark Smith
Abigail Smith
Bill Adams
give output
John
Jane
Mark
Bill
(tested in gawk 4.2.1)