I need to make an awk script that allows me to calculate the standard deviation and the mean of the variable "Population" by Continent. There are different continents.
My script is as follows:
BEGIN {
FS = ","
Continent["Europe"];Continent["Africa"];Continent["Asia"];Continent["Latin America and the Caribbean"];Continent["Oceania"]
}
FNR>1 {
if ($4!="" && $11!="") {
found
n[$4]
wx[$4] = $3
wxx[$4] = $3 * $3
}
}END {
print "Continent,Mean,Deviation"
for (i in Continent) {
if (n[i] > 0) {
avg[i] = wx[i] / n[i]
var = wxx[i] / n[i] - avg[i] * avg[i]
if (var >= 0)
std[i] = sqrt(var)
else
std[i] = 0
printf ("%s,%.2f,%.2f%\n", Continent,avg[i],std[i])
}
}
}
A sample of my dataset:
Country,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Total Cases per 1 Mil.pop,Total Deaths per 1 Mil.pop,Death percentage,Survival Percentage,No infected Percentage
Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31,0.42,99.56
Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.28,9.41,90.47
Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.59,0.57,99.41
My desired output:
Continent Mean Deviation
Africa 42847108 3298802049
Asia 1938293848 23984033
Europe 190319838 12020492
However when I run the code as gawk -f script.awk dataset.csv
, I get the error:
fatal: attempt to use array `Continent' in a scalar context
How can this be solved?
CodePudding user response:
GNU AWK
is unable to print array as whole, which lead to
fatal: attempt to use array `Continent' in a scalar context
you need to use currently processed key of array, which is i
in your case, that is replace
printf ("%s,%.2f,%.2f%\n", Continent,avg[i],std[i])
using
printf ("%s,%.2f,%.2f%\n",i,avg[i],std[i])
My desired output:
If you aim to have fixed width columns, you might use number after %
sign to just to right and -
number afte %
to just to left, consider following simmple example, let file.txt
content be
Able 1000
Baker 150
Charlie 200
and say you want to turn it into fixed-width format, then you might do
awk '{printf "%-10s%7.2f\n", $1, $2}' file.txt
and get output
Able 1000.00
Baker 150.00
Charlie 200.00
If you want to know more consult Modifiers for printf
Formats
(tested in GNU Awk 5.0.1)
CodePudding user response:
It's important to use singulars/plurals when naming scalars/arrays, You array should be named Continents[]
(plural) as it contains multiple continent names and then for (i in Continent)
should be for (Continent in Continents)
and then all the rest becomes obvious.
Try this (untested):
BEGIN {
FS = ","
split("Europe,Africa,Asia,Latin America and the Caribbean,Oceania",tmp)
for (i in tmp) {
Continent = tmp[i]
Continents[Continent]
}
}
FNR>1 {
Continent = $4
if (Continent !="" && $11!="") {
found
cnts[Continent]
wxs[Continent] = $3
wxxs[Continent] = $3 * $3
}
}
END {
print "Continent,Mean,Deviation"
for (Continent in cnts) {
cnt = cnts[Continent]
wx = wxs[Continent]
wxx = wxxs[Continent]
avg = wx / cnt
var = wxx / cnt - avg * avg
std = (var >= 0 ? sqrt(var) : 0)
printf ("%s,%.2f,%.2f%%\n", Continent, avg, std)
}
}
You don't need the cnt
, wx
, and wxx
variables but I'm just showing the clarity and simplicity you get if you use plurals for array names and singulars for array content scalars.