Home > OS >  AWK remove query params from URL
AWK remove query params from URL

Time:05-13

I have access.log file with >1m lines. The exaple of line:

113.10.154.38 - - [27/May/2016:03:36:26  0200] "POST /index.php?option=com_jce&task=plugin&plugin=imgmanager&file=imgmanager&method=form&cid=20&6bc427c8a7981f4fe1f5ac65c1246b5f=cf6dd3cf1923c950586d0dd595c8e20b HTTP/1.1" 200 22 "-" "BOT/0.1 (BOT for JCE)" "-"

I need to parse log lines to count 10 most common urls, BUT i need to remove query params from url. Without query params i wrote this code

awk '{print $7}' test.log | sort | uniq -c | sort -rn | \
head | awk '{print NR,"\b. URL:", $2,"\n   Requests:", $1}'

But i don't know how to remove query params and count top 10 most common urls without params to get clear top of requests.

CodePudding user response:

Use the sub() function to remove a pattern from a string.

You also need to do this when you're extracting the field to sort and count unique values.

awk '{sub(/\?.*/, "", $7); print $7}' test.log | sort | uniq -c | sort -rn | ...
  • Related