Home > database >  Cut delimiters in bash
Cut delimiters in bash

Time:12-28

I'm trying to collect information from a log file. Currently I want to find the most popular ip sources, specificaly the top 10 package senders. My log file looks like this:

Feb 24 21:32:17 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=67.24.151.60 DST=11.11.11.69 LEN=48 TOS=0x00 PREC=0x00 TTL=115 ID=28482 DF PROTO=TCP SPT=3252 DPT=445 WINDOW=8760 RES=0x00 SYN URGP=0  
Feb 24 21:32:17 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=67.24.151.60 DST=11.11.11.69 LEN=48 TOS=0x00 PREC=0x00 TTL=115 ID=28534 DF PROTO=TCP SPT=3252 DPT=445 WINDOW=8760 RES=0x00 SYN URGP=0  
Feb 24 21:32:18 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=67.24.151.60 DST=11.11.11.69 LEN=48 TOS=0x00 PREC=0x00 TTL=115 ID=28575 DF PROTO=TCP SPT=3252 DPT=445 WINDOW=8760 RES=0x00 SYN URGP=0  
Feb 24 21:33:19 bridge kernel: INBOUND UDP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=202.108.249.51 DST=11.11.11.100 LEN=404 TOS=0x00 PREC=0x00 TTL=114 ID=49546 PROTO=UDP SPT=1282 DPT=1434 LEN=384  
Feb 24 21:33:54 bridge kernel: Legal Broadcast: IN=br0 PHYSIN=eth1 OUT=br0 PHYSOUT=eth0 SRC=11.11.11.67 DST=11.11.11.255 LEN=241 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP SPT=138 DPT=138 LEN=221  
Feb 24 21:33:55 bridge kernel: Legal Broadcast: IN=br0 PHYSIN=eth1 OUT=br0 PHYSOUT=eth0 SRC=11.11.11.67 DST=11.11.11.255 LEN=232 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=UDP SPT=138 DPT=138 LEN=212  
Feb 24 21:35:11 bridge kernel: INBOUND UDP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=163.21.152.7 DST=11.11.11.67 LEN=69 TOS=0x00 PREC=0x00 TTL=45 ID=0 DF PROTO=UDP SPT=1812 DPT=1812 LEN=49  
Feb 24 21:36:12 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=4.34.188.37 DST=11.11.11.125 LEN=48 TOS=0x00 PREC=0x00 TTL=115 ID=16941 DF PROTO=TCP SPT=1649 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb 24 21:36:13 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=4.34.188.37 DST=11.11.11.125 LEN=48 TOS=0x00 PREC=0x00 TTL=115 ID=17045 DF PROTO=TCP SPT=1649 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb 24 21:36:14 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=4.34.188.37 DST=11.11.11.125 LEN=48 TOS=0x00 PREC=0x00 TTL=115 ID=17164 DF PROTO=TCP SPT=1649 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb 24 21:36:16 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=61.41.216.94 DST=11.11.11.80 LEN=48 TOS=0x00 PREC=0x00 TTL=109 ID=1309 DF PROTO=TCP SPT=3179 DPT=445 WINDOW=16384 RES=0x00 SYN URGP=0

I built this script:

    x=`grep "SRC=" $1 | cut -d " " -f13 | sed 's/\.[0-9]\ $//g' | sort -n | uniq -c | sort -n -r | head -n 10 `
    echo "${x}"

However, in my log file I have this Legal Broadcast column that messes all up, since I use the " " delimiter for my cut. Is there a way of using a string as the cut delimiter to filter my ip srcs instead of my current use of " "?

Currently my output looks like this:

100398 PHYSOUT=eth1
12311 PHYSOUT=eth0
9454 SRC=11.11.11
8121 SRC=63.13.135
6394 SRC=127.0.0

It's as it should be, with as many packages as the SRC has and its IP. However it seems that the Legal Broadcast lines are messing the top results, showing information from previous columns.

CodePudding user response:

With the information you provided, it's hard to guess what exactly would constitute an acceptable solution. A simple fix for what you describe would be to skip any line which contains "Legal Broadcast".

However, these long sequences are almost always better refactored into a single Awk script anyway.

awk '/SRC=/ && !/Legal Broadcast/ { ip=$13; sub(/\.[0-9]*$/, "", ip);   sum[ip] }
    END { for(net in sum) print sum[net], net }' "$1" |
sort -rn | head -n 10

If you can't predict the location of the SRC field you can just replace everything else:


awk '/SRC=/ && !/Legal Broadcast/ { ip=$0; sub(/.* SRC=/, "", ip); sub(/\.[0-9]( .*)?$/, "", ip);   sum[ip] }
    END { for(net in sum) print sum[net], net }' "$1" |
sort -rn | head -n 10

(Obviously take out the !/Legal Broadcast/ condition if you want to include those lines, too.)

Capturing a variable just so you can echo it is a useless use of echo and a (tiny) waste of memory. Notice also how the /g flag for sed is only meaningful if you expect a line to contain more than a single match (which is not possible with the regex you used).

CodePudding user response:

You can do everything by simply using GNU awk:

gawk -F= -v RS=' ' 'BEGIN { PROCINFO["sorted_in"] = "@val_num_desc" }
        $1 ~ /^(SRC|PHYSOUT)$/ {   a[$1 FS $2] }
        END { for (i in a) print a[i] "\t" i }' "$1"

CodePudding user response:

This might be what you're looking for:

sed -n 's/.* SRC=\([^ ]*\)\..*/\1/p' "$1" | sort | uniq -c | sort -rn | head -n10
  • Related