Print sum of Nth column at the header of file with existing rows bash-CodePudding

I have an input file with billions of records and a header. Header consists of meta info, total number of rows and sum of the sixth column. I am splitting the file into small sizes, due to which my header record must be updated as the sum of sixth column and total rows is changed.

This is the sample record

filename: testFile.text

00|STMT|08-09-2022 13:24:56||5|13.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP

Expected:

filename: testFile_1.text

00|STMT|08-09-2022 13:24:56||3|6.10|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP

filename: testFile_2.text

00|STMT|08-09-2022 13:24:56||2|7.00|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP

I am able to split the file and calculate the sum but unable to replace the value in header part. This is the script I have made

#!/bin/bash

splitRowCount=$1
transactionColumn=$2

filename=$(basename -- "$3")
extension="${filename##*.}"
nameWithoutExt="${filename%.*}"

echo "splitRowCount: $splitRowCount"
echo "transactionColumn: $transactionColumn"


awk 'NR == 1 { head = $0 } NR % '$splitRowCount' == 2 { filename = "'$nameWithoutExt'_" int((NR-1)/'$splitRowCount') 1 ".'$extension'"; print head > filename } NR != 1 { print >> filename }' $filename


 ls *.txt | while read line
 do
  firstLine=$(head -n 1 $line);
  awk -F '|'  'NR !=1 {sum  = '$transactionColumn'}END {print sum} '   $line
 done

CodePudding user response：

Here's an awk solution for splitting the original file into files of n records. The idea is to accumulate the records until the given count is reached then generate a file with the updated header and the accumulated records:

n=3
file=./testFile.text

awk -v numRecords="$n"  '
    BEGIN {
        FS = OFS = "|"

        if ( match(ARGV[1],/[^\/]\.[^\/]*$/) ) {
            filePrefix = substr(ARGV[1],1,RSTART)
            fileSuffix = substr(ARGV[1],RSTART 1)
        } else {
            filePrefix = ARGV[1]
            fileSuffix = ""
        }

        if (getline headerStr <= 0)
            exit 1
        split(headerStr, headerArr)
    }
    (NR-2) % numRecords == 0 && recordsCount {
        outfile = filePrefix "_"   filesCount fileSuffix
        print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
        printf("%s", records) > outfile
        close(outfile)
        
        records = ""
        recordsCount = recordsSum = 0
    }
    {
        records = records $0 ORS
        recordsCount  
        recordsSum  = $6
    }
    END {
        if (recordsCount) {
            outfile = filePrefix "_"   filesCount fileSuffix
            print headerArr[1],headerArr[2],headerArr[3],headerArr[4],recordsCount,recordsSum,headerArr[7] > outfile
            printf("%s", records) > outfile
            close(outfile)
        }
    }
' "$file"

With the given sample you'll get:

testFile_1.text

00|STMT|08-09-2022 13:24:56||3|6.1|SHA2
10|000047290|8ddcf4b2356dfa7f326ca8004a9bdb6096330fc4f3b842a971deaf660a395f65|18-01-2020|12:36:57|3.10|00004729018-01-20201|APP
10|000052736|cce280392023b23df2a00ace4b82db8eb61c112bb14509fb273c523550059317|07-02-2017|16:27:49|2.00|00005273607-02-20171|APP
10|000070355|f2e86d2731d32f9ce960a0f5883e9b688c7e57ab9c2ead86057f98426407d87a|17-07-2019|20:25:02|1.00|00007035517-07-20192|APP

testFile_2.text

00|STMT|08-09-2022 13:24:56||2|7|SHA2
10|000070355|54c1fc2667e160a11ae1dbf54d3ba993475cd33d6ececdd555fb5c07e64a241b|17-07-2019|20:25:02|5.00|00007035517-07-20192|APP
10|000072420|f5dac143082631a1693e0fb5429d3a185abcf3c47b091be2f30cd50b5cf4be11|14-06-2021|20:52:21|2.00|00007242014-06-20212|APP

CodePudding user response：

With your shown samples please try following awk code(Written and tested in GNU awk). Here I have defined awk variables named fileInitials which contains your output file's initial name eg: testFile then extension which contains output file's extension eg: .txt here. Then comes lines which will be your value on how many lines you want to have in a output file.

You need not to run shell awk code, this could be done in a single awk like shown following.

awk -v count="1" -v fileInitials="testFile" -v extension=".txt" -v lines="3" '
BEGIN { FS=OFS="|" }
FNR==1{
  match($0,/^([^|]*\|[^|]*\|[^|]*\|[^|]*\|[^|]*)\|[^|]*(.*)/,arr)
  header1=arr[1]
  header2=arr[2]
  outputFile=(fileInitials count extension)
  next
}
{
  if(prev!=count){
    print (header1,sum header2 ORS val) > (outputFile)
    close(outputFile)
    outputFile=(fileInitials count extension)
    sum=0
    val=""
  }
  sum =$6
  val=(val?val ORS:"") $0
  prev=count
  count=(  countline%lines==0?  count:count)
}
END{
  if(count && val){
     print (header1,sum header2 ORS val) > (outputFile)
     close(outputFile)
  }
}
'   Input_file