Can AWK transpose this data faster than a bash for loop?-CodePudding

Here is my sample data:

DATA='target1.domain,LAST_VULN_AGENT_SCAN,2022/12/07 03:14:49
target2.domain,LAST_VULN_AGENT_SCAN,2022/12/07 03:14:30
target3.domain,LAST_VULN_AGENT_SCAN,2022/12/07 03:14:49
target1.domain,LAST_VULN_NONCRED_SCAN,2022/12/07 00:08:43
target2.domain,LAST_VULN_NONCRED_SCAN,2022/12/07 00:08:43
target3.domain,LAST_VULN_NONCRED_SCAN,2022/12/07 00:08:43
target1.domain,LAST_VULN_CRED_SCAN,2022/12/07 04:59:06
target2.domain,LAST_VULN_CRED_SCAN,2022/12/07 04:59:06
target3.domain,LAST_VULN_CRED_SCAN,2022/12/07 03:03:52'

Here is my desired output:

Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN
target1.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 04:59:06
target2.domain,2022/12/07 03:14:30,2022/12/07 00:08:43,2022/12/07 04:59:06
target3.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 03:03:52

Here is my current for loop:

UNIQUETARGETS=$(echo "${DATA}" | cut -d , -f 1 | sort | uniq)

echo 'Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN'
for TARGET in $UNIQUETARGETS; do
    LAST_VULN_AGENT_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_AGENT_SCAN' | cut -d , -f 3)
    LAST_VULN_NONCRED_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_NONCRED_SCAN' | cut -d , -f 3)
    LAST_VULN_CRED_SCAN=$(echo "${DATA}" | grep "${TARGET}," | grep 'LAST_VULN_CRED_SCAN' | cut -d , -f 3)
    echo "${TARGET},${LAST_VULN_AGENT_SCAN},${LAST_VULN_NONCRED_SCAN},${LAST_VULN_CRED_SCAN}"
done

While this method works, I am confident that AWK can do this much faster. I've searched for several "transpose" awk snippets, but nothing I've found can do quite what I want. Any help is greatly appreciated!

CodePudding user response：

Using any awk:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    names[$1]
    times[$1,$2] = $3
}
END {
    hdr = "Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN"
    print hdr

    numCols = split(hdr,scans)

    for ( name in names ) {
        printf "%s%s", name, OFS
        for ( colNr=2; colNr<=numCols; colNr   ) {
            scan = scans[colNr]
            time = times[name,scan]
            printf "%s%s", time, (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN
target3.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 03:03:52
target2.domain,2022/12/07 03:14:30,2022/12/07 00:08:43,2022/12/07 04:59:06
target1.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 04:59:06

If you care about the order of output lines they can be sorted in whatever order you like.

CodePudding user response：

I would use ruby with its csv parser for this personally.

ruby -r csv -e '
options={ :headers=>false }
inp=CSV.parse($<.read, **options)
names=inp.map{|sa| sa[0]}.uniq.sort
header=["Name"] inp.map{|sa| sa[1]}.uniq.sort
data=inp.each_with_object(Hash.new {|h,k| h[k]={} }) { 
        |(name, func, date), h|
        h[name][func]=date
}
table=CSV::Table.new([], headers: header)
names.each{ |n| line=[n]; header[1..].each { |h| line << data[n][h] }
    table << line }
puts table
' file

Prints:

Name,LAST_VULN_AGENT_SCAN,LAST_VULN_CRED_SCAN,LAST_VULN_NONCRED_SCAN
target1.domain,2022/12/07 03:14:49,2022/12/07 04:59:06,2022/12/07 00:08:43
target2.domain,2022/12/07 03:14:30,2022/12/07 04:59:06,2022/12/07 00:08:43
target3.domain,2022/12/07 03:14:49,2022/12/07 03:03:52,2022/12/07 00:08:43

CodePudding user response：

If you just wanted to do it in bash the logic still works with very few tweaks required.

declare -A names=() v=();  
while IFS=, read n c t; do names[$n]=1; v["$n,$c"]="$t"; done < test.in
echo "Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN"
for n in "${!names[@]}"
do echo "$n,${v[$n,LAST_VULN_AGENT_SCAN]},${v[$n,LAST_VULN_NONCRED_SCAN]},${v[$n,LAST_VULN_CRED_SCAN]}"
done

hdr="Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN"
echo "$hdr"; IFS=,  read -a col <<< "$hdr"
declare -A names=() v=();  
while IFS=, read n c t; do names[$n]=1; v["$n,$c"]="$t"; done < test.in
for n in "${!names[@]}"; do echo "$n,${v[$n,${col[1]}]},${v[$n,${col[2]}]},${v[$n,${col[3]}]}"; done

The output -

Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN
target3.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 03:03:52
target2.domain,2022/12/07 03:14:30,2022/12/07 00:08:43,2022/12/07 04:59:06
target1.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 04:59:06

And as mentioned, you could always add sort.
If that last line were

for n in "${!names[@]}"; do echo "$n,${v[$n,${col[1]}]},${v[$n,${col[2]}]},${v[$n,${col[3]}]}"; done | sort

then the output becomes

Name,LAST_VULN_AGENT_SCAN,LAST_VULN_NONCRED_SCAN,LAST_VULN_CRED_SCAN
target1.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 04:59:06
target2.domain,2022/12/07 03:14:30,2022/12/07 00:08:43,2022/12/07 04:59:06
target3.domain,2022/12/07 03:14:49,2022/12/07 00:08:43,2022/12/07 03:03:52

But awk is still smarter.