I have the following dataset and I would like to implement an iteration that checks line by line(awk or for) in an awk file that after executing it in the following way:
gawk -f file.awk dataset.csv
Allow me to get a file with the records without duplicates and the float in the last column rounded to two decimals. Below, I attach a sample of my dataset and as you can see there should be only one record per country.
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.915438434
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.333333333
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.801681569
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.416472111
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.039169471
Since my level is not advanced, I don't mind if the code is long so I can familiarise myself with the steps the code goes through.
awk '{a[$1] }END{for (i in a)if (a[i]>1)print i;}' file
I found that this command can help in such functionality it would be a shell script in not an awk script.
Finally, as a guidance the following would be the desired output according to the sample posted as there are no duplicates in my sample
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.91
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.33
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.80
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.41
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.03
Thank you in advance for your help
CodePudding user response:
Your original code:
awk '{a[$1] }END{for (i in a)if (a[i]>1)print i;}' file
has the test inverted: a[i]>1
should be a[i]==1
to only print unique lines.
Some ways to implement truncation of n
to 2 decimal places are:
n = substr(n,1,match(n,/[.]/) 2)
n = sprintf("%0.2f",n)
So your script could be:
BEGIN { FS=OFS="," } # delimit columns by comma
# csv must not have embedded commas
NR==1 {print; next} # print header
{ $10 = sprintf("%0.2f", $10) } # truncate column 10
# rewrites $0 so uses OFS
{ a[$0] } # using $0 means entire line must be unique
END { for (i in a) if (a[i]==1) print i } # print unique lines
Given your comment about data cleansing, it would probably be better to use a two-pass approach: use your original code to alert you to erroneous input, then truncate in a separate pass.
Note that if a single column changes you will get what appears to be valid input. These lines are different:
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Afghanistan,Afghanistan,AFG,40462106,Asia,177827,7671,4395,190,4.313743132
I suppose you wanted to detect this, so your sanity checks need to be more sophisticated.