How can i filter fast the huge csv file based on list of ids from another file using bash?-CodePudding

I have 2 csv files. The first one is just a list of IDs (around 300k rows). Lest call it ids.csv

id
1
3
7
...

The second one is list of objects with id (around 2 milions rows). Lets call it data.csv

id,date,state
1,2022-01-01,true
4,2022-01-03,false
...

I would like to build the 3rd csv file with rows from data.csv which having IDs from ids.csv. It is not necessary that all of IDs from ids.csv will be present in data.csv.

I tried something like this:

while IFS=, read -r line
do
    awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv

It is working, but executions takes something like forever. I tried to split it for several streams, but the script is working slow by itself so it was not helping mutch.

Can you suggest me better or more optimal way how to filter that data.csv faster?

CodePudding user response：

Probably the most common answer on this forum in one way or another:

awk -F',' 'NR==FNR{ids[$1]; next} $1 in ids' ids.csv data.csv > out.csv

That will be an order of magnitude faster than your existing script.

Regarding why your original script is slow, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

CodePudding user response：

Using csvgrep from the handy csvkit bundle of utilities:

$ csvgrep -c id -f ids.csv data.csv
id,date,state
1,2022-01-01,true

-f filename is like plain grep's -f argument; if any of the lines in it are a match for the given column name, that record is printed.