Home > database >  awk match by variable with dot in it
awk match by variable with dot in it

Time:06-29

I have a script that will iterate over a file containing domains (google.com, youtube.com, etc). The purpose of the script is to check how many times each domain is included in the 12th column of a tab seperated value file.

while read domain; do
    awk -F '\t' '$12 == '$domain'' data.txt | wc -l
done < domains.txt

However awk seems to be interpretating the dots in the domains as a special character. The following error message is shown:

awk: syntax error at source line 1
context is
        $12 ~ >>>  google. <<< com
awk: bailing out at source line 1

I am a beginner in bash so any help would be greatly appreciated!

CodePudding user response:

When you write:

domain='google.com'
awk -F '\t' '$12 == '$domain'' data.txt

the $domain is outside of any quotes:

awk -F '\t' '$12 == '$domain'      ' data.txt
            <       >       <      >
            start   end     start  end

and so exposed to the shell for interpretation first and THEN it becomes part of the body of the awk script before awk sees it. So what awk sees is:

awk -F '\t' '$12 == google.com' data.txt

and google.com is not a valid symbol (e.g. variable or function) name nor string nor number. What you MEANT to do was:

awk -F '\t' '$12 == "'"$domain"'"' data.txt

so the shell would see "$domain" instead of just $domain (see https://mywiki.wooledge.org/Quotes for why that's important) and awk would finally see:

awk -F '\t' '$12 == "google.com"' data.txt

which is fine as now "google.com" is a string, not a symbol BUT you should never allow shell variables to expand to become part of an awk script as there are other caveats so what you should really have done is:

awk -F '\t' -v dom="$domain" '$12 == dom' data.txt

See How do I use shell variables in an awk script? for more information.

By the way, even after fixing the above problem do not do this:

while read domain; do
    awk -F '\t' -v dom="$domain" '$12 == dom' data.txt | wc -l
done < domains.txt

as it'll be immensely slow and contains insidious bugs (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice). Do something like this instead (untested):

awk -F'\t' '
    NR==FNR {
        cnt[$1] = 0
        next
    }
    $12 in cnt {
        cnt[$12]  
    }
    END {
        for ( dom in cnt ) {
            print dom, cnt[dom]
        }
    }
' domains.txt data.txt

That will be far more efficient, robust, and portable than calling awk inside a shell read loop.

See What are NR and FNR and what does "NR==FNR" imply? for how that awk script works. Get the book Effective AWK Programming, 5th Edition, by Arnold Robbins to learn awk.

CodePudding user response:

awk -F '\t' '$12 == '$domain'' data.txt | wc -l

The single quotes are building an awk program. They are not something visible to awk. So awk sees this:

$12 == google.com

Since there aren't any quotes around google.com, that is a syntax error. You just need to add quotation marks.

awk -F '\t' '$12 == "'"$domain"'"'  data.txt

The quotes jammed together like that are a little confusing, but it's just this:

 '....'    stuff to send to awk. Single quotes are for the shell.
 '..."...' a double quote inside the awk program for awk to see
 '...'"..."  stuff in double quotes _outside_ the awk program for the shell

We can combine those like this:

 '..."'"$var"'"...'  

That's a bunch of literal awk code ending in a double-quote, followed by the expansion of the shell parameter var, which is double-quoted as usual in the shell for safety, followed by more literal awk code starting with a double quotes. So the end result is a string passed to awk that includes the value of the var inside double quotes.

But you don't have to be so fancy or confusing since awk provides the -v option to set variables from the shell:

awk -v domain="$domain" '$12 == domain' data.txt

Since the domain is not quoted inside the awk code, it is interpreted as the name of a variable. (Periods are not legal in variable names, which is why you got a syntax error with your domains; if you hadn't, though, awk would have treated them as empty and been looking for lines whose twelfth field was likewise blank.)

CodePudding user response:

Use a combination of cut to print the 12th column of the TAB-delimited file, sort and uniq to count the items:

cut -f12 data.txt | sort | uniq -c
  • Related