Home > OS >  awk remove duplicates based on two columns and custom duplication rule
awk remove duplicates based on two columns and custom duplication rule

Time:11-16

I want to process a CSV input file like the following :

a;b
b;c
b;a
c;d
x;y
d;c

and remove both duplicate lines defined by the rule : a;b and b;a are considered duplicate and therefore should be removed, the same rule applies to c;d and d;c they shoud be removed.

I tried to process file twice and use the condition NR==FNR to figure which pass it is (first or second) but i can't figure out how to implement the test on the duplication rule i defined above.

please help me

CodePudding user response:

Would you please try the following:

awk -F';' '
NR==FNR {                                       # 1st pass
    if (seen[$1$2]   || seen[$2$1]  ) {         # if "ab" or "ba" already exists
        dupe[$1";"$2]  ; dupe[$2";"$1]          # then mark "a;b" and "b;a" as duplicates
    }
    next
}
! dupe[$0]                                      # print unless duplicates
' file file

Output:

b;c
x;y

CodePudding user response:

$ awk -F';' '{ks[$0]; a[$2 FS $1]  } END{for(k in ks) if(!a[k]) print k}' file

x;y
b;c
  • Related