I'm working on looping over hundreds of thousands of CSV files to generate more files from them. The requirement is to extract previous 1 month, 3 month, month, 1 year & 2 years of data from every file & generate new files from them.
I've written the below script which gets the job done but is super slow. This script will need to be run quite frequently which makes my life cumbersome. Is there a better way to achieve the outcome I'm after or possibly enhance the performance of this script please?
for k in *.csv; do
sed -n '/'"$(date -d "2 year ago" ' %Y-%m')"'/,$p' ${k} > temp_data_store/${k}.2years.csv
sed -n '/'"$(date -d "1 year ago" ' %Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1year.csv
sed -n '/'"$(date -d "6 month ago" ' %Y-%m')"'/,$p' ${k} > temp_data_store/${k}.6months.csv
sed -n '/'"$(date -d "3 month ago" ' %Y-%m')"'/,$p' ${k} > temp_data_store/${k}.3months.csv
sed -n '/'"$(date -d "1 month ago" ' %Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1month.csv
done
CodePudding user response:
You read each CSV five times. It would be better to read each CSV only once.
You extract the same data multiple times. All but one parts are subsets of the others.
- 2 years ago is a subset of 1 year ago, 6 months ago, 3 months ago and 1 month ago.
- 1 year ago is a subset of 6 months ago, 3 months ago and 1 month ago.
- 6 months ago is a subset of 3 months ago and 1 month ago.
- 3 months ago is a subset of 1 month ago.
This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". You can cascade the different searches with tee
.
The following assumes, that the contents of your files is ordered chronologically. (I simplified the quoting a bit)
sed -n "/$(date -d '1 month ago' ' %Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1month.csv |
sed -n "/$(date -d '3 month ago' ' %Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.3months.csv |
sed -n "/$(date -d '6 month ago' ' %Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.6months.csv |
sed -n "/$(date -d '1 year ago' ' %Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1year.csv |
sed -n "/$(date -d '2 year ago' ' %Y-%m')/,\$p" "${k}" > temp_data_store/${k}.2years.csv
CodePudding user response:
Current performance-related issues:
- reading each input file 5x times => we want to limit this to a single read per input file
- calling
date
5x times (necessary) for each input file (unnecessary) => make the 5xdate
calls prior to thefor k in *.csv
loop [NOTE: the overhead for repeateddate
calls will pale in comparison to the repeated reads of the input files]
Assumptions:
- input files are comma-delimited (otherwise need to adjust the proposed solution)
- the date to be tested is of the format
YYYY-MM-...
- the data to be tested is the 1st field of the comma-delimited input file (otherwise need to adjust the proposed solution)
- output filename prefix is the input filename sans the
.csv
Sample input:
$ cat input.csv
2019-09-01,line 1
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
We only need to do the date calculations once so we'll do this in bash
and prior to OP's for k in *.csv
loop:
# date as of writing this answer: 2021-10-12
$ yr2=$(date -d "2 year ago" ' %Y-%m')
$ yr1=$(date -d "1 year ago" ' %Y-%m')
$ mon6=$(date -d "6 month ago" ' %Y-%m')
$ mon3=$(date -d "3 month ago" ' %Y-%m')
$ mon1=$(date -d "1 month ago" ' %Y-%m')
$ typeset -p yr2 yr1 mon6 mon3 mon1
declare -- yr2="2019-10"
declare -- yr1="2020-10"
declare -- mon6="2021-04"
declare -- mon3="2021-07"
declare -- mon1="2021-09"
One awk
idea (replaces all of the sed
calls in OP's current for k in *.csv
loop):
# determine prefix to be used for output files ...
$ k=input.csv
$ prefix="${k//.csv/}"
$ echo "${prefix}"
input
awk -v yr2="${yr2}" \
-v yr1="${yr1}" \
-v mon6="${mon6}" \
-v mon3="${mon3}" \
-v mon1="${mon1}" \
-v prefix="${prefix}" \
-F ',' ' # define input field delimiter as comma
{ split($1,arr,"-") # date to be compared is in field #1
testdate=arr[1] "-" arr[2]
if ( testdate >= yr2 ) print $0 > prefix".2years.csv"
if ( testdate >= yr1 ) print $0 > prefix".1year.csv"
if ( testdate >= mon6 ) print $0 > prefix".6months.csv"
if ( testdate >= mon3 ) print $0 > prefix".3months.csv"
if ( testdate >= mon1 ) print $0 > prefix".1month.csv"
}
' "${k}"
NOTE: awk
can dyncially process the input filename to determine the filename prefix (see FILENAME
variable) but would still need to know the target directory name (assuming writing to a different directory from where the input file resides)
This generates the following files:
for f in "${prefix}".*.csv
do
echo "############# ${f}"
cat "${f}"
echo ""
done
############# input.2years.csv
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1year.csv
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.6months.csv
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.3months.csv
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1month.csv
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
Additional performance improvements:
- [especially for largish files] read from one filesystem, write to a 2nd/different filesystem; even better would be a separate filesystem for each of the 5x different output files - would require a minor tweak to the
awk
solution - processing X number of input files in parallel, eg,
awk
code could be placed in abash
function and then called via<function_name> <input_file> &
; this can be done viabash
loop controls,parallel
,xargs
, etc - if running parallel operations, will need to limit the number of parallel operations based primarily on disk subsystem throughput, ie, how many concurrent reads/writes can disk subsystem handle before slowing down due to read/write contention