Subsetting a CSV based on a percentage of unique values-CodePudding

I've been reading through other similar questions. I have this working, but it is very slow due to the size of the CSV I'm working with. Are there ways to make this more efficient?

My goal: I have an incredibly large CSV (>100 GB). I would like to take all of the unique values in a column, extract 10% of these, and then use that 10% to subsample the original CSV.

What I'm doing:

I'm pulling all unique values from column 11 and writing those to a text file.

cat File1.csv | cut -f11 -d , | sort | uniq > uniqueValues.txt

Next, I'm sampling a random 10% of the values in "uniqueValues.txt".

cat uniqueValues.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .10) print $0'} > uniqueValues.10pct.txt

Next, I'm pulling the rows in File1.csv which have column 11 matching values from "uniqueValues.10pct.txt.

awk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv

As far as I can tell, this seems to be working. Does this seem reasonable? Any suggestions on how to improve the efficiency?

CodePudding user response：

Any suggestions on how to improve the efficiency?

Avoid sort in 1st step as 2nd and 3rd do not care about order, you might do your whole 1st step using single awk command as follows:

awk 'BEGIN{FS=","}!arr[$11]  {print $11}' File1.csv > uniqueValues.txt

Explanation: I inform GNU AWK that field separator (FS) is comma, then for each line I do arr[$11] to get number of occurence of value in 11th column and use ! to negate it, so 0 becomes true, whilst 1 and greater becomes false. If this hold true I print 11th column.

Please test above against your 1st step for you data and then select one which is faster.

As for 3th step you might attemp using not-GNU AWK if you are allowed to install tools at your machine. For example author of article¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! found nawk faster than GNU AWK and mawk faster than nawk. After installing prepare test data and measure times for

gawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
nawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
mawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv

then use one which proved by fastest.

¹be warned that values shown pertains to versions available at September 2009, you might get different times with version available at June 2022.

CodePudding user response：

You might find this to be faster (untested since no sample input/output provided):

cut -f11 -d',' File1.csv |
    sort -u > uniqueValues.txt
numUnq=$(wc -l < uniqueValues.txt)
shuf -n "$(( numUnq / 10 ))" uniqueValues.txt |
    awk -F',' 'NR==FNR{a[$1]; next} $11 in vals' - File1.csv

You could try replacing that first cut | sort; numUnq=$(wc...) with

numUnq=$(awk -F',' '!seen[$11]  {print $11 > "uniqueValues.txt"; cnt  } END{print cnt 0}' File1.csv)

to see if that's any faster but I doubt it since cut, sort, and wc are all very fast while awk has to do regexp-based field splitting and store all $11 values in memory (which can get slow as the array size increases due to how dynamic array allocation works).