I have two files, each with two columns and sorted only by the second column, such as:
File 1:
176 AAATC
6 CCGTG
80 TTTCG
File 2:
20 AAATC
77 CTTTT
50 TTTTT
I would like to use comm command using options -13 and -23 to get two different files reporting the different lines between the two files with the corresponding count number, but only comparing the second columns (i.e. the strings). What I tried so far was something like:
comm -23 <(cut -d$'\t' -f2 file1.txt) <(cut -d$'\t' -f2 file2.txt)
But I could only have the strings in output, without the numbers:
CCGTG
TTTCG
While what I want would be:
6 CCGTG
80 TTTCG
Any suggestion?
Thanks!
CodePudding user response:
You can use join
instead of comm
:
join -1 2 -2 2 File1 File2 -a 1 -o 1.1,1.2,2.2
It will output the matching lines, too, but you can remove them with
| grep -v '[ACTG] [ACTG]'
Explanation:
-1 2
use the second column in file 1 for joining;-2 2
similarly, use the second column in file 2;-a 1
show also non-matching lines from file 1 - these are the ones you want in the end;-o
specifies the output format, here we want columns 1 and 2 from file 1 and column 2 from file 2 (this is just arbitrary, you can use column 1 as well, but the second command would be different:| grep -v '[ACTG] [0-9]'
).
CodePudding user response:
comm
is not the right tool for this job, and while join
will work you also need to look at running join
twice and then further filter the results with some other command (eg, grep
).
One awk
idea that requires a single pass through each input file:
awk 'BEGIN {FS=OFS="\t"}
FNR==NR { f1[$2]=$1; next } # save 1st file entries
$2 in f1 { delete f1[$2]; next } # 2nd file: if $2 in f1[] then delete f1[] entry and skip this line else ..
{ f2[$2]=$1 } # save 2nd file entries
END { # at this point:
# f1[] contains rows where field #2 only exists in the 1st file
# f2[] contains rows where field #2 only exists in the 2nd file
PROCINFO["sorted_in"]="@ind_str_asc"
for (i in f1) print f1[i],i > "file-23"
for (i in f2) print f2[i],i > "file-13"
}
' file1 file2
NOTE: the PROCINFO["sorted_in"]
line requires GNU awk
; without this line we cannot guarantee the order of writes to the final output files, and OP would then need to add more (awk
) code to maintain the ordering or use another OS-level utility (eg, sort
) to sort the final files
This generates:
$ cat file-23
6 CCGTG
80 TTTCG
$ cat file-13
77 CTTTT
50 TTTTT