I've been reading through other similar questions. I have this working, but it is very slow due to the size of the CSV I'm working with. Are there ways to make this more efficient?
My goal: I have an incredibly large CSV (>100 GB). I would like to take all of the unique values in a column, extract 10% of these, and then use that 10% to subsample the original CSV.
What I'm doing:
- I'm pulling all unique values from column 11 and writing those to a text file.
cat File1.csv | cut -f11 -d , | sort | uniq > uniqueValues.txt
- Next, I'm sampling a random 10% of the values in "uniqueValues.txt".
cat uniqueValues.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .10) print $0'} > uniqueValues.10pct.txt
- Next, I'm pulling the rows in File1.csv which have column 11 matching values from "uniqueValues.10pct.txt.
awk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
As far as I can tell, this seems to be working. Does this seem reasonable? Any suggestions on how to improve the efficiency?
CodePudding user response:
Any suggestions on how to improve the efficiency?
Avoid sort
in 1st step as 2nd and 3rd do not care about order, you might do your whole 1st step using single awk command as follows:
awk 'BEGIN{FS=","}!arr[$11] {print $11}' File1.csv > uniqueValues.txt
Explanation: I inform GNU AWK
that field separator (FS
) is comma, then for each line I do arr[$11]
to get number of occurence of value in 11th column and use !
to negate it, so 0
becomes true, whilst 1
and greater becomes false. If this hold true I print
11th column.
Please test above against your 1st step for you data and then select one which is faster.
As for 3th step you might attemp using not-GNU AWK
if you are allowed to install tools at your machine. For example author of article¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! found nawk
faster than GNU AWK
and mawk
faster than nawk
. After installing prepare test data and measure times for
gawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
nawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
mawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
then use one which proved by fastest.
¹be warned that values shown pertains to versions available at September 2009, you might get different times with version available at June 2022.
CodePudding user response:
You might find this to be faster (untested since no sample input/output provided):
cut -f11 -d',' File1.csv |
sort -u > uniqueValues.txt
numUnq=$(wc -l < uniqueValues.txt)
shuf -n "$(( numUnq / 10 ))" uniqueValues.txt |
awk -F',' 'NR==FNR{a[$1]; next} $11 in vals' - File1.csv
You could try replacing that first cut | sort; numUnq=$(wc...)
with
numUnq=$(awk -F',' '!seen[$11] {print $11 > "uniqueValues.txt"; cnt } END{print cnt 0}' File1.csv)
to see if that's any faster but I doubt it since cut
, sort
, and wc
are all very fast while awk
has to do regexp-based field splitting and store all $11
values in memory (which can get slow as the array size increases due to how dynamic array allocation works).