Home > Blockchain >  How to discard row based on a common column of two files using Linux cmd?
How to discard row based on a common column of two files using Linux cmd?

Time:10-29

file1.csv

col1,col2,col3
a,b,c
b,v,n
x,u,v
t,m,m

file2.csv

col1,col2,col3
p,m,n
a,z,i

col1 acts as a primary key in both files. If any value of col1 of file2.csv occurs in file1.csv, then this row will be discarded from file1.csv.

output file:

col1,col2,col3
b,v,n
x,u,v
t,m,m

Note: I'm interested in Unix based solution. please provide a solution sort or uniq cmd or anything else.

CodePudding user response:

There's a join command on Unix that does pretty much exactly what you want:

join -v1 -t, \
  <(tail  2 file1.txt | sort -k1 -t,) \
  <(tail  2 file2.txt | sort -k1 -t,)

For the sample files you've given, this is its output:

b,v,n
t,m,m
x,u,v

Command Breakdown

  • join -v1 -t,
    • -v1: display rows from the 1st file that are not pairable with lines in the 2nd file via the join column (default column used for joining is 1, but it's overridable via the -1 and -2 options)
    • -t,: use comma as a field/column separator
  • <(tail 2 file1.txt | sort -k1 -t,)
    • <( … ): the join command expects file names as arguments, so we use process substitution to create such temporary files from the output of the nested commands
    • tail 2 file1.txt: skip the header line
    • sort -k1 -t,: the join command expects sorted files
      • -t,: use comma as a field/column separator
      • -k1: sort by the 1st field
  • Related