Home > front end >  Using bash comm command on columns but returning the entire line
Using bash comm command on columns but returning the entire line

Time:11-15

I have two files, each with two columns and sorted only by the second column, such as:

File 1: 

176 AAATC
6   CCGTG
80  TTTCG

File 2:

20 AAATC
77 CTTTT
50 TTTTT

I would like to use comm command using options -13 and -23 to get two different files reporting the different lines between the two files with the corresponding count number, but only comparing the second columns (i.e. the strings). What I tried so far was something like:

comm -23 <(cut -d$'\t' -f2 file1.txt) <(cut -d$'\t' -f2 file2.txt)

But I could only have the strings in output, without the numbers:

 CCGTG
 TTTCG

While what I want would be:

 6  CCGTG
 80 TTTCG

Any suggestion?

Thanks!

CodePudding user response:

You can use join instead of comm:

join -1 2 -2 2 File1 File2 -a 1 -o 1.1,1.2,2.2

It will output the matching lines, too, but you can remove them with

| grep -v '[ACTG] [ACTG]'

Explanation:

  • -1 2 use the second column in file 1 for joining;
  • -2 2 similarly, use the second column in file 2;
  • -a 1 show also non-matching lines from file 1 - these are the ones you want in the end;
  • -o specifies the output format, here we want columns 1 and 2 from file 1 and column 2 from file 2 (this is just arbitrary, you can use column 1 as well, but the second command would be different: | grep -v '[ACTG] [0-9]').

CodePudding user response:

comm is not the right tool for this job, and while join will work you also need to look at running join twice and then further filter the results with some other command (eg, grep).

One awk idea that requires a single pass through each input file:

awk 'BEGIN {FS=OFS="\t"}

FNR==NR  { f1[$2]=$1; next }           # save 1st file entries

$2 in f1 { delete f1[$2]; next }       # 2nd file: if $2 in f1[] then delete f1[] entry and skip this line else ..
         { f2[$2]=$1 }                 # save 2nd file entries

END      { # at this point:
           # f1[] contains rows where field #2 only exists in the 1st file
           # f2[] contains rows where field #2 only exists in the 2nd file

           PROCINFO["sorted_in"]="@ind_str_asc"
           for (i in f1) print f1[i],i > "file-23"
           for (i in f2) print f2[i],i > "file-13"
         }
' file1 file2

NOTE: the PROCINFO["sorted_in"] line requires GNU awk; without this line we cannot guarantee the order of writes to the final output files, and OP would then need to add more (awk) code to maintain the ordering or use another OS-level utility (eg, sort) to sort the final files

This generates:

$ cat file-23
6       CCGTG
80      TTTCG

$ cat file-13
77      CTTTT
50      TTTTT
  • Related