Consider the following AWK statement:
REGEX='^\s*1'
cat INPUTFILE | awk -v "REG"="$REGEX" '$1~REG{print $0}' > OUTPUTFILE
This reads the file INPUTFILE, and print the lines matching the pattern REGEX (lines which start with any number of blank spaces followed by the number 1)
The code runs fine and I am able to get the correct result. However, I am receiving the following error:
awk: warning: escape sequence `\s' treated as plain `s'
I am wondering why I am getting this error and how I can fix it? If the escape sequence \s was treated as s, then the pattern should be ^s*1 which should denote something else, right?
CodePudding user response:
Warning: This answer concerns solely GNU AWK
.
Firstly you must be aware that there are delicate difference between using regexp constant (/
.../
) and string ("
..."
), you need to double \
when using later rather then earlier, in this case
REGEX='^\\s*1'
awk -v "REG"="$REGEX" '$1~REG{print $0}' INPUTFILE > OUTPUTFILE
Observe that GNU AWK
can read file by itself, thus using cat here is not recommended. If you want to know more read Computed Regexps (The GNU Awk User's Guide).
CodePudding user response:
I'm putting aside the use of \s
in an awk
regex and just explaining the warning.
awk [-F sepstring] [-v assignment]... program [argument...]
Here's what POSIX says about the assignment of a variable with -v
:
[...]. The characters following the
=
shall be interpreted as if they appeared in the awk program preceded and followed by a double-quote ("
) character, [...].
It means that awk -v var='\s' 'BEGIN{}'
is equivalent to awk 'BEGIN{var = "\s"}'
, so, with \s
not being a valid C-style escape sequence, the \s
is interpreted as s
.
If you want the content of var
to be \s
literally then you'll have to write awk -v var='\\s' ...
CodePudding user response:
This would be correctly written as:
regex='^[[:space:]]*1'
<INPUTFILE awk -v "reg=$regex" '$1~reg{print $0}' >OUTPUTFILE
Note:
- We're using
[[:space:]]
instead of\s
. This is the only POSIX-standard way to write a character class;\s
is a PCRE extension adopted by only some tools, e.g. GNU versions of sed, grep, and awk, for use in their EREs. - We're using a lower-case name for the shell variable,
regex
, storing the regex. All-caps names are used for variables that reflect or modify operating system or shell behavior; see the relevant POSIX specification, keeping in mind that shell and environment variables share a single namespace (setting a shell variable overwrites any like-named environment variable, even ifexport
is not used). - We're using a lower-case name for the awk variable,
reg
populated from the shell variableregex
, to avoid that clashing with any of the builtin variables that awk provides. You should never use all upper-case user-defined variables in awk. - We're using
<inputfile
instead ofcat inputfile |
; this gives awk a direct handle on the file, instead of making awk read the output of a separate program that's reading the file. For awk it's a fairly small difference; but for a tool liketail
,wc -c
, or evensort
that can be implemented more efficiently with a seekable handle, the improvement from using a redirection instead ofcat
can be huge.
CodePudding user response:
AWK already splits fields on blank characters. So you could probably just use:
regex='^1' awk '$1 ~ ENVIRON["regex"]'
Or, to match any whitespace character (using this only makes sense if for some reason $1 includes \f
\r
or \v
chars):
# GNU awk supports '\s' if it's not within a bracket expression
regex='^\s*1' awk '$1 ~ ENVIRON["regex"]'
# Or POSIXly using the [:space:] class in a bracket expression:
regex='^[[:space:]]*1' awk '$1 ~ ENVIRON["regex"]'