Home > other >  awk to compare files
awk to compare files

Time:07-12

I'm trying to use awk to find the common lines between two files and save it as a .txt as follows:

>CL1 1
lcu_1 lcu_2 lcu_3
>CL2 1
lcu_6 lcu_4 lcu_8
>CL1 1
ler_1 lcu_2 ler_3
>CL2 1
lcu_1 lcu_2 lcu_3
>CL3 1
lcu_6 lcu_4 lcu_8

Expected output with the two common "CL's":

>CL1 1
lcu_1 lcu_2 lcu_3
>CL2 1
lcu_6 lcu_4 lcu_8

The code I'm using:

awk 'FNR==NR {a[$1]; next} $1 in a' file1.cls file2.cls > out.txt

Actual output:

CL1 1
CL2 1

Does anyone know to solve this?

CodePudding user response:

With awk, can use > as the record separator. The output is a bit messed up though:

$ awk 'BEGIN {RS = ORS = ">"} NR == FNR {clu[$1]; next} $1 in clu' file2.cls file1.cls
>CL1 1
lcu_1 lcu_2 lcu_3
>CL2 1
lcu_6 lcu_4 lcu_8
>⏎

My shell outputs to indicate no trailing newline.

Cleaning up the output:

awk '
    BEGIN {RS = ">"}
    NR == FNR {clu[$1]; next}
    length($1) && $1 in clu {gsub(/^\n|\n$/, ""); print ">" $0}
' file2.cls file1.cls

CodePudding user response:

The line lcu_1 lcu_2 lcu_3 appears before >CL2 1 in your first file but after it in the second. I shall assume you don't care about ordering.

You don't specify what should happen if a file contains duplicate/identical lines. I shall assume only one copy is required.

The unix utility comm finds common/distinct lines in two sorted files:

comm -12 <(sort -u file1) <(sort -u file2)

giving:

>CL1 1
>CL2 1
lcu_1 lcu_2 lcu_3
lcu_6 lcu_4 lcu_8
  • Related