I am working on the following dataset (a sample can be found below) and I would like to create a bash script that allows me to select only the records that meet a set of conditions and all records fulfilling these conditions are collected in another file.
1. Regex to get which continent must be Asia, Africa or Europe, therefore discarding the rest.
2. Regex to get "Death Percentatge" must be greater than 0.50
3- Regex to get the "Survival Percentatge" to be greater than 2.00.
It is important that it is a bash script that uses these regular expressions in if conditions.
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage, Survival Percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31,0.42
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27,9.41
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58,0.57
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38,51.54
And as these records are included in the new file, "Survival Percentatge" and "Death Percentatge" must be subtracted to create a new variable called "Dif.porc.pts" to collect the percentage difference in absolute value.
The code I have proposed is as follows but I don't have experience in bash as in other languages.
read
while IFS=, read _ _ _ Continent _ _ _ _ _ Death Percentatge Survival Percentatge; do
if [[Continent ~ /Africa|Asia|Europe/) && (Death Percentage ~ /[0].[5-9][0-9] &&(Survival
Percentage ~ /[2-9].[0-9][0-9]]]
diff.porc.pts=$($Survical Percentatge/$Death Percentatge)|sed 's/-//'
paste -sd > new_file.txt
fi
cat new_file.txt
I also attach a sample of what the desired output should look like.
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage, Survival Percentage, diff.porc.pts
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31,0.42,3.89
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27,9.41,8.14
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58,0.57,2.01
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38,51.54,51.16
I would be grateful if you could help me to complete it.
Thank you in advance
CodePudding user response:
- You cannot include whitespaces in bash variable names.
- You've misspelled
Percentage
asPercentatge
. - You've miscouted the column position of
Continent
. - Regex operator in bash is
=~
, not~
. - You should not enclose the regex with slashes.
- You will need to use
bc
or other external command for arithmetic calculation of decimal numbers.
Then would you please try the following:
#!/bin/bash
while read -r line; do
if (( nr == 0 )); then # header line
echo "$line,diff.porc.pts"
else # body
IFS=, read _ _ _ _ Continent _ _ _ _ pDeath pSurvival <<< "$line"
if [[ $Continent =~ ^(Africa|Asia|Europe)$ && $pDeath =~ ^(0\.[5-9]|[1-9]) && $pSurvival =~ ^([2-9]\.|[1-9][0-9]) ]]; then
diff=$(echo "$pSurvival - $pDeath" | bc)
echo "$line,$diff"
fi
fi
done < input_file.txt > new_file.txt
Output:
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage, Survival Percentage,diff.porc.pts
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27,9.41,8.14
It looks the record of Albania
only meets the conditions contrary to
the shown desired output.