Comparing two csv files with different lengths but output only the line where it matches the same va-CodePudding

I've been trying to compare two csv file using simple shell script but I think the code that I was using is not doing it's job. what I want to do is, compare the two files using Column 6 from first.csv and Column 2 in second.csv and when it matches, it will output the line from first.csv. see below as an example

first.csv

1,0,3210820,0,536,7855712
1,0,3523340820,0,36,53712
1,0,321023423i420,0,336,0255712
1,0,321082234324,0,66324,027312

second.csv

14,7855712,Whie,Black
124,7855712,Green,Black
174,1197,Black,Orange
1284,98132197,Yellow,purple
35384,9811123197,purple,purple
13354,0981123131197,green,green
183434,0811912313127,white,green

Output should be from the first file:

1,0,3210820,0,536,7855712

I've been using the code below.

cat first.csv | while read line    
do    
  cat second.csv | grep $line > output_file    
done

please help. Thank you

CodePudding user response：

Your question is not entirely clear, but here is what I think you want:

cat first.csv | while read LINE; do
    VAL=`echo "$LINE" | cut -d, -f6`
    grep -q "$VAL" second.csv && echo $LINE
done

The first line in the loop extracts the 6th field from the line and stores it in VAL. The next line checks (quietly), if VAL occurs in second.csv and if so, outputs the line.

Note that grep will check for any occurence in second.csv, not only in field 2. To check only against field 2, change it to:

    cut -d, -f2 second.csv | grep -q "$VAL" && echo $LINE

Unrelated to your question I would like to comment, that those things can much more efficiency be solved in a language like python.

CodePudding user response：

Well... If you have bash with process substitution you can treat all the 2nd fields in second.csv (with a $ appended to anchor the search at the end of the line) as input from a file. Then using grep -f match data from the 2nd column of second.csv with the end of the line in first.csv doing what you intend.

You can use the <(process) form to redirect the 2nd field as a file using:

grep -f <(awk -F, '{print $2"$"}' second.csv) first.csv

Example Output

With the data you show in first.csv and second.csv you get:

1,0,3210820,0,536,7855712

Adding the "$" anchor as part of the 2nd field from second.csv should satisfy the match only in the 6th field (end of line) in first.csv.

The benefit here being there is but a single call to grep and awk, not an additional subshell spawned per-iteration. Doesn't matter with small files like your sample input, but with millions of lines, we are talking hours (or days) of processing time difference.

CodePudding user response：

rq (https://github.com/fuyuncat/rquery/releases) can do this perfectly.

It can join compare the content of multiple files like a SQL.

[ rquery]$ ./rq -q "p d/,/ | m @2 where @fileid=1 | s @raw | f @fileid=2 and @6=@r[1][1]" samples/second.csv samples/first.csv
1,0,3210820,0,536,7855712