Home > database >  Getting sum of values in a particular column with some conditions
Getting sum of values in a particular column with some conditions

Time:12-05

I have a tab delimited file like this:

chr1    104517  105076  abc    148
chr1    127781  128051  def    89
chr1    186884  186981  xyz    97
chr1    127781  128051  def    55
chr1    890934  891105  abc    50
chr1    104517  105076  abc    24
chr1    890934  891105  xyz    19

First, for every values in column 4 I wanted sum of the values in column 5. Like

abc 222
def 144 
xyz 116

I did it with this code:

awk -F'\t' '{ SUM[$4]  = $5 } END { for (j in SUM) print j, SUM[j] }' filename

Now I want to do this separately for every unique combination of first three columns. For example, in case of above input file, I want this output:

chr1    104517  105076  abc  172
chr1    127781  128051  def  144
chr1    186884  186981  xyz  97
chr1    890934  891105  abc  50  xyz  19

Can someone please tell me the way to do this in bash script?

Thank you

CodePudding user response:

I'd turn to perl instead of awk for its better support for complex data structures:

$ perl -M5.020 -lane '
  our $data;
  $data->{$F[0]}{$F[1]}{$F[2]}{$F[3]}  = $F[4];
  END {
    for my $c1 (sort keys %$data) {
      for my $c2 (sort { $a <=> $b } keys %{$data->{$c1}}) {
        for my $c3 (sort { $a <=> $b } keys %{$data->{$c1}{$c2}}) {
          my $rest = $data->{$c1}{$c2}{$c3};
          print join("\t", $c1, $c2, $c3, %$rest{sort keys %$rest});
        }
      }
    }
  }' input.tsv
chr1    104517  105076  abc     172
chr1    127781  128051  def     144
chr1    186884  186981  xyz     97
chr1    890934  891105  abc     50      xyz     19

Basically, builds a 4-dimensional hash table using the first four columns of each line as keys, with the sum of the fifth column as the final value. Then walks the levels of the table in sorted order and prints the result.

  • Related