I am working on the following dataset and I would like to create an awk script to make an arithmetic operation between with other columns and add the result of each record in a new column that could be called "Survival Percentage" with only 2 decimals.
The operation would be the following:
((Total Cases - Total Deaths)/Population)*100
Below you can see a sample of the dataset:
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638
Here is an example of the desired output after running as: gawk -F file.awk dataset.csv
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage, Survival Percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132,0.42
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509,9.41
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976,0.57
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638,51.54
The code I have tried to implement is as follows but it doesn't even run and it is a shell script and not an awk script as I wish.
awk 'BEGIN { FS=OFS="," } NR == 1 { $11="new" } NR > 1 { $11=(($6-$7)/$4)*100 }1' dataset.csv
From comments: After eliminating ^M as you told me, I have detected that there are rows in the "population" column that should be numerical and there is a string. Do you have any idea to discard the records that meet this condition using also awk and then perform the operation of my code? Any idea?
CodePudding user response:
Considering your samples and shown attempts here is the awk
code which you can try. Which removes Control M characters(which we found out in comments) and this also checks that your 6th, 7th and 4th columns should be integers/floats etc and nothing else, try it out once.
awk '
BEGIN { FS=OFS="," }
{ gsub(/\r/,"") }
FNR==1 { $11="new" }
FNR > 1 && ($6 0=$6 && $7 0=$7 && $4 0=$4){
$11=(($6-$7)/$4)*100
}
1' dataset.csv
OR to get values till 2 decimal points try something like(make use of sprintf
function to save values to $11
itself:
awk '
BEGIN { FS=OFS="," }
{ gsub(/\r/,"") }
FNR==1 { $11="new" }
FNR > 1 && ($6 0=$6 && $7 0=$7 && $4 0=$4){
$11=sprintf("%0.2f",(($6-$7)/$4)*100)
}
1' dataset.csv