Standard Deviation from multiple files in bash-CodePudding

I wish to calculate the standard deviation from a range of files titled "res_NUMBER.cs" which are formatted as a CSV. Example data includes

1,M,CA,54.9130  
1,M,CA,54.9531  
1,M,CA,54.8845  
1,M,CA,54.7517  
1,M,CA,54.8425  
1,M,CA,55.2648  
1,M,CA,55.0876

I have calculated the mean using

#!/bin/bash


files=`ls res*.cs`  
for f in $files; do 
        echo "$f" 
        echo " " 
        #Count number of lines N 
        lines=`cat $f | wc -l` 
        #Sum Total 
        sum=`cat $f | awk -F "," '{print $4}' | paste -sd  | bc` 
        #Mean 
        mean=`echo "scale=5 ; $sum / $lines" | bc` 
        echo "$mean" 
        echo " "

I would like to calculate the standard deviation across each file. I understand that the standard deviation formula is

S.D=sqrt((1/N)*(sum of (value - mean)^2))

But I am unsure how I would implement this into my script.

CodePudding user response：

I would recommend using Miller for this task.

This computes the standard deviation of the mean of each file:

for f in res*.cs
do
    mlr --csv -N stats1 -f '4' -a 'mean' "$f"
done |
mlr --csv -N stats1 -f '1' -a 'stddev'

If you want to compute the standard deviation of the records of all the files then it would be:

mlr --csv -N stats1 -f '4' -a 'stddev' res*.cs

CodePudding user response：

awk is powerful enough to calculate the mean of one file easily

$ awk -F, '{sum =$4} END{print sum/NR}' file

to add standard deviation (not that your formula is for population, not for sample, that's what I replicate here)

 $ awk -F, '{sum =$4; ss =$4^2} END{print m=sum/NR,sqrt(ss/NR-m^2)}' file
 54.9567 0.15778

this uses the fact that stddev = sqrt(Var(x)) = sqrt( E(x^2) - E(x)^2 ) which has worse numerical accuracy (since squaring the values instead of diff) but works fine if your values have low bounds.

The simplest is then using this in a for loop for the files

for f in res*.cs
do 
    awk -F, '{sum =$4; ss =$4^2} 
         END {print FILENAME; 
              print "mean:", m=sum/NR, "stddev:", sqrt(ss/NR-m^2)}' "$f"
end