Home > Mobile >  create an awk file to filter out unduplicated lines of a dataset
create an awk file to filter out unduplicated lines of a dataset

Time:05-21

I have the following dataset and I would like to implement an iteration that checks line by line(awk or for) in an awk file that after executing it in the following way:

gawk -f file.awk dataset.csv

Allow me to get a file with the records without duplicates and the float in the last column rounded to two decimals. Below, I attach a sample of my dataset and as you can see there should be only one record per country.

Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.915438434
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.333333333
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.801681569
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.416472111
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.039169471

Since my level is not advanced, I don't mind if the code is long so I can familiarise myself with the steps the code goes through.


awk '{a[$1]  }END{for (i in a)if (a[i]>1)print i;}' file

I found that this command can help in such functionality it would be a shell script in not an awk script.

Finally, as a guidance the following would be the desired output according to the sample posted as there are no duplicates in my sample

Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38
Angola,Angola,AGO,34654212,Africa,99194,1900,2862,55,1.91
Anguilla,Anguilla,AIA,15237,Latin America and the Caribbean,2700,9,177200,591,0.33
Antigua and Barbuda,Antigua and Barbuda,ATG,99348,Latin America and the Caribbean,7493,135,75422,1359,1.80
Argentina,Argentina,ARG,45921761,Latin America and the Caribbean,9041124,128065,196881,2789,1.41
Armenia,Armenia,ARM,2972939,Asia,422574,8617,142140,2898,2.03

Thank you in advance for your help

CodePudding user response:

Your original code:

awk '{a[$1]  }END{for (i in a)if (a[i]>1)print i;}' file

has the test inverted: a[i]>1 should be a[i]==1 to only print unique lines.

Some ways to implement truncation of n to 2 decimal places are:

n = substr(n,1,match(n,/[.]/) 2)

n = sprintf("%0.2f",n)

So your script could be:

BEGIN { FS=OFS="," } # delimit columns by comma
                     # csv must not have embedded commas

NR==1 {print; next} # print header

{ $10 = sprintf("%0.2f", $10) } # truncate column 10
                                # rewrites $0 so uses OFS

{ a[$0]   } # using $0 means entire line must be unique

END { for (i in a) if (a[i]==1) print i } # print unique lines

Given your comment about data cleansing, it would probably be better to use a two-pass approach: use your original code to alert you to erroneous input, then truncate in a separate pass.

Note that if a single column changes you will get what appears to be valid input. These lines are different:

Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Afghanistan,Afghanistan,AFG,40462106,Asia,177827,7671,4395,190,4.313743132

I suppose you wanted to detect this, so your sanity checks need to be more sophisticated.

  • Related