Why am I getting the following in AWK awk: warning: escape sequence `\s' treated as plain `s&#-CodePudding

Consider the following AWK statement:

REGEX='^\s*1'
cat INPUTFILE | awk -v "REG"="$REGEX" '$1~REG{print $0}' > OUTPUTFILE

This reads the file INPUTFILE, and print the lines matching the pattern REGEX (lines which start with any number of blank spaces followed by the number 1)

The code runs fine and I am able to get the correct result. However, I am receiving the following error:

awk: warning: escape sequence `\s' treated as plain `s'

I am wondering why I am getting this error and how I can fix it? If the escape sequence \s was treated as s, then the pattern should be ^s*1 which should denote something else, right?

CodePudding user response：

Warning: This answer concerns solely GNU AWK.

Firstly you must be aware that there are delicate difference between using regexp constant (/.../) and string ("..."), you need to double \ when using later rather then earlier, in this case

REGEX='^\\s*1'
awk -v "REG"="$REGEX" '$1~REG{print $0}' INPUTFILE > OUTPUTFILE

Observe that GNU AWK can read file by itself, thus using cat here is not recommended. If you want to know more read Computed Regexps (The GNU Awk User's Guide).

CodePudding user response：

I'm putting aside the use of \s in an awk regex and just explaining the warning.

awk [-F sepstring] [-v assignment]... program [argument...]

Here's what POSIX says about the assignment of a variable with -v:

[...]. The characters following the = shall be interpreted as if they appeared in the awk program preceded and followed by a double-quote (") character, [...].

It means that awk -v var='\s' 'BEGIN{}' is equivalent to awk 'BEGIN{var = "\s"}', so, with \s not being a valid C-style escape sequence, the \s is interpreted as s.

If you want the content of var to be \s literally then you'll have to write awk -v var='\\s' ...

CodePudding user response：

This would be correctly written as:

regex='^[[:space:]]*1'
<INPUTFILE awk -v "reg=$regex" '$1~reg{print $0}' >OUTPUTFILE

Note:

We're using [[:space:]] instead of \s. This is the only POSIX-standard way to write a character class; \s is a PCRE extension adopted by only some tools, e.g. GNU versions of sed, grep, and awk, for use in their EREs.
We're using a lower-case name for the shell variable, regex, storing the regex. All-caps names are used for variables that reflect or modify operating system or shell behavior; see the relevant POSIX specification, keeping in mind that shell and environment variables share a single namespace (setting a shell variable overwrites any like-named environment variable, even if export is not used).
We're using a lower-case name for the awk variable, reg populated from the shell variable regex, to avoid that clashing with any of the builtin variables that awk provides. You should never use all upper-case user-defined variables in awk.
We're using <inputfile instead of cat inputfile |; this gives awk a direct handle on the file, instead of making awk read the output of a separate program that's reading the file. For awk it's a fairly small difference; but for a tool like tail, wc -c, or even sort that can be implemented more efficiently with a seekable handle, the improvement from using a redirection instead of cat can be huge.

CodePudding user response：

AWK already splits fields on blank characters. So you could probably just use:

regex='^1' awk '$1 ~ ENVIRON["regex"]'

Or, to match any whitespace character (using this only makes sense if for some reason $1 includes \f \r or \v chars):

# GNU awk supports '\s' if it's not within a bracket expression
regex='^\s*1' awk '$1 ~ ENVIRON["regex"]'
# Or POSIXly using the [:space:] class in a bracket expression:
regex='^[[:space:]]*1' awk '$1 ~ ENVIRON["regex"]'