I have a tab delimited file like this:
chr1 104517 105076 abc 148
chr1 127781 128051 def 89
chr1 186884 186981 xyz 97
chr1 127781 128051 def 55
chr1 890934 891105 abc 50
chr1 104517 105076 abc 24
chr1 890934 891105 xyz 19
First, for every values in column 4 I wanted sum of the values in column 5. Like
abc 222
def 144
xyz 116
I did it with this code:
awk -F'\t' '{ SUM[$4] = $5 } END { for (j in SUM) print j, SUM[j] }' filename
Now I want to do this separately for every unique combination of first three columns. For example, in case of above input file, I want this output:
chr1 104517 105076 abc 172
chr1 127781 128051 def 144
chr1 186884 186981 xyz 97
chr1 890934 891105 abc 50 xyz 19
Can someone please tell me the way to do this in bash script?
Thank you
CodePudding user response:
I'd turn to perl
instead of awk
for its better support for complex data structures:
$ perl -M5.020 -lane '
our $data;
$data->{$F[0]}{$F[1]}{$F[2]}{$F[3]} = $F[4];
END {
for my $c1 (sort keys %$data) {
for my $c2 (sort { $a <=> $b } keys %{$data->{$c1}}) {
for my $c3 (sort { $a <=> $b } keys %{$data->{$c1}{$c2}}) {
my $rest = $data->{$c1}{$c2}{$c3};
print join("\t", $c1, $c2, $c3, %$rest{sort keys %$rest});
}
}
}
}' input.tsv
chr1 104517 105076 abc 172
chr1 127781 128051 def 144
chr1 186884 186981 xyz 97
chr1 890934 891105 abc 50 xyz 19
Basically, builds a 4-dimensional hash table using the first four columns of each line as keys, with the sum of the fifth column as the final value. Then walks the levels of the table in sorted order and prints the result.