I have 2 csv files. The first one is just a list of IDs (around 300k rows). Lest call it ids.csv
id
1
3
7
...
The second one is list of objects with id (around 2 milions rows). Lets call it data.csv
id,date,state
1,2022-01-01,true
4,2022-01-03,false
...
I would like to build the 3rd csv file with rows from data.csv which having IDs from ids.csv. It is not necessary that all of IDs from ids.csv will be present in data.csv.
I tried something like this:
while IFS=, read -r line
do
awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv
It is working, but executions takes something like forever. I tried to split it for several streams, but the script is working slow by itself so it was not helping mutch.
Can you suggest me better or more optimal way how to filter that data.csv faster?
CodePudding user response:
Probably the most common answer on this forum in one way or another:
awk -F',' 'NR==FNR{ids[$1]; next} $1 in ids' ids.csv data.csv > out.csv
That will be an order of magnitude faster than your existing script.
Regarding why your original script is slow, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
CodePudding user response:
Using csvgrep
from the handy csvkit bundle of utilities:
$ csvgrep -c id -f ids.csv data.csv
id,date,state
1,2022-01-01,true
-f filename
is like plain grep
's -f
argument; if any of the lines in it are a match for the given column name, that record is printed.