I'm trying to use awk to find the common lines between two files and save it as a .txt as follows:
>CL1 1
lcu_1 lcu_2 lcu_3
>CL2 1
lcu_6 lcu_4 lcu_8
>CL1 1
ler_1 lcu_2 ler_3
>CL2 1
lcu_1 lcu_2 lcu_3
>CL3 1
lcu_6 lcu_4 lcu_8
Expected output with the two common "CL's":
>CL1 1
lcu_1 lcu_2 lcu_3
>CL2 1
lcu_6 lcu_4 lcu_8
The code I'm using:
awk 'FNR==NR {a[$1]; next} $1 in a' file1.cls file2.cls > out.txt
Actual output:
CL1 1
CL2 1
Does anyone know to solve this?
CodePudding user response:
With awk, can use >
as the record separator. The output is a bit messed up though:
$ awk 'BEGIN {RS = ORS = ">"} NR == FNR {clu[$1]; next} $1 in clu' file2.cls file1.cls
>CL1 1
lcu_1 lcu_2 lcu_3
>CL2 1
lcu_6 lcu_4 lcu_8
>⏎
My shell outputs ⏎
to indicate no trailing newline.
Cleaning up the output:
awk '
BEGIN {RS = ">"}
NR == FNR {clu[$1]; next}
length($1) && $1 in clu {gsub(/^\n|\n$/, ""); print ">" $0}
' file2.cls file1.cls
CodePudding user response:
The line lcu_1 lcu_2 lcu_3
appears before >CL2 1
in your first file but after it in the second. I shall assume you don't care about ordering.
You don't specify what should happen if a file contains duplicate/identical lines. I shall assume only one copy is required.
The unix utility comm
finds common/distinct lines in two sorted files:
comm -12 <(sort -u file1) <(sort -u file2)
giving:
>CL1 1
>CL2 1
lcu_1 lcu_2 lcu_3
lcu_6 lcu_4 lcu_8