I wish to calculate the standard deviation from a range of files titled "res_NUMBER.cs" which are formatted as a CSV. Example data includes
1,M,CA,54.9130
1,M,CA,54.9531
1,M,CA,54.8845
1,M,CA,54.7517
1,M,CA,54.8425
1,M,CA,55.2648
1,M,CA,55.0876
I have calculated the mean using
#!/bin/bash
files=`ls res*.cs`
for f in $files; do
echo "$f"
echo " "
#Count number of lines N
lines=`cat $f | wc -l`
#Sum Total
sum=`cat $f | awk -F "," '{print $4}' | paste -sd | bc`
#Mean
mean=`echo "scale=5 ; $sum / $lines" | bc`
echo "$mean"
echo " "
I would like to calculate the standard deviation across each file. I understand that the standard deviation formula is
S.D=sqrt((1/N)*(sum of (value - mean)^2))
But I am unsure how I would implement this into my script.
CodePudding user response:
I would recommend using Miller for this task.
This computes the standard deviation of the mean of each file:
for f in res*.cs
do
mlr --csv -N stats1 -f '4' -a 'mean' "$f"
done |
mlr --csv -N stats1 -f '1' -a 'stddev'
If you want to compute the standard deviation of the records of all the files then it would be:
mlr --csv -N stats1 -f '4' -a 'stddev' res*.cs
CodePudding user response:
awk
is powerful enough to calculate the mean of one file easily
$ awk -F, '{sum =$4} END{print sum/NR}' file
to add standard deviation (not that your formula is for population, not for sample, that's what I replicate here)
$ awk -F, '{sum =$4; ss =$4^2} END{print m=sum/NR,sqrt(ss/NR-m^2)}' file
54.9567 0.15778
this uses the fact that stddev = sqrt(Var(x)) = sqrt( E(x^2) - E(x)^2 ) which has worse numerical accuracy (since squaring the values instead of diff) but works fine if your values have low bounds.
The simplest is then using this in a for loop for the files
for f in res*.cs
do
awk -F, '{sum =$4; ss =$4^2}
END {print FILENAME;
print "mean:", m=sum/NR, "stddev:", sqrt(ss/NR-m^2)}' "$f"
end