Home > other >  Create an awk script to generate a new column in a dataset?
Create an awk script to generate a new column in a dataset?

Time:05-21

I am working on the following dataset and I would like to create an awk script to make an arithmetic operation between with other columns and add the result of each record in a new column that could be called "Survival Percentage" with only 2 decimals.

The operation would be the following:

((Total Cases - Total Deaths)/Population)*100

Below you can see a sample of the dataset:

Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638

Here is an example of the desired output after running as: gawk -F file.awk dataset.csv

Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage, Survival Percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.313743132,0.42
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.275057509,9.41
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.587215976,0.57
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.382270638,51.54

The code I have tried to implement is as follows but it doesn't even run and it is a shell script and not an awk script as I wish.

awk  'BEGIN { FS=OFS="," } NR == 1 { $11="new" } NR > 1 { $11=(($6-$7)/$4)*100 }1' dataset.csv

From comments: After eliminating ^M as you told me, I have detected that there are rows in the "population" column that should be numerical and there is a string. Do you have any idea to discard the records that meet this condition using also awk and then perform the operation of my code? Any idea?

CodePudding user response:

Considering your samples and shown attempts here is the awk code which you can try. Which removes Control M characters(which we found out in comments) and this also checks that your 6th, 7th and 4th columns should be integers/floats etc and nothing else, try it out once.

awk '
BEGIN  { FS=OFS="," }
{ gsub(/\r/,"") }
FNR==1 { $11="new"  }
FNR > 1 && ($6 0=$6 && $7 0=$7 && $4 0=$4){
  $11=(($6-$7)/$4)*100
}
1' dataset.csv

OR to get values till 2 decimal points try something like(make use of sprintf function to save values to $11 itself:

awk '
BEGIN  { FS=OFS="," }
{ gsub(/\r/,"") }
FNR==1 { $11="new"  }
FNR > 1 && ($6 0=$6 && $7 0=$7 && $4 0=$4){
  $11=sprintf("%0.2f",(($6-$7)/$4)*100)
}
1' dataset.csv
  • Related