Home > Blockchain >  Create a subset in bash script with variables that meet a condition
Create a subset in bash script with variables that meet a condition

Time:05-24

I am working on the following dataset (a sample can be found below) and I would like to create a bash script that allows me to select only the records that meet a set of conditions and all records fulfilling these conditions are collected in another file.

1. Regex to get which continent must be Asia, Africa or Europe, therefore discarding the rest. 

2. Regex to get "Death Percentatge" must be greater than 0.50

3- Regex to get the "Survival Percentatge" to be greater than 2.00.

It is important that it is a bash script that uses these regular expressions in if conditions.
Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage, Survival Percentage
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31,0.42
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27,9.41
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58,0.57
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38,51.54

And as these records are included in the new file, "Survival Percentatge" and "Death Percentatge" must be subtracted to create a new variable called "Dif.porc.pts" to collect the percentage difference in absolute value.

The code I have proposed is as follows but I don't have experience in bash as in other languages.


read
while IFS=, read _ _ _ Continent _ _ _ _ _ Death Percentatge Survival Percentatge; do
     if [[Continent ~ /Africa|Asia|Europe/) && (Death Percentage ~ /[0].[5-9][0-9] &&(Survival 
     Percentage ~ /[2-9].[0-9][0-9]]]
          diff.porc.pts=$($Survical Percentatge/$Death Percentatge)|sed 's/-//'
          paste -sd > new_file.txt
     fi
cat new_file.txt

I also attach a sample of what the desired output should look like.

Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage, Survival Percentage, diff.porc.pts
Afghanistan,Afghanistan,AFG,40462186,Asia,177827,7671,4395,190,4.31,0.42,3.89
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27,9.41,8.14
Algeria,Algeria,DZA,45236699,Africa,265691,6874,5873,152,2.58,0.57,2.01
Andorra,Andorra,AND,77481,Europe,40024,153,516565,1975,0.38,51.54,51.16

I would be grateful if you could help me to complete it.

Thank you in advance

CodePudding user response:

  • You cannot include whitespaces in bash variable names.
  • You've misspelled Percentage as Percentatge.
  • You've miscouted the column position of Continent.
  • Regex operator in bash is =~, not ~.
  • You should not enclose the regex with slashes.
  • You will need to use bc or other external command for arithmetic calculation of decimal numbers.

Then would you please try the following:

#!/bin/bash

while read -r line; do
    if (( nr   == 0 )); then            # header line
        echo "$line,diff.porc.pts"
    else                                # body
        IFS=, read _ _ _ _ Continent _ _ _ _ pDeath pSurvival <<< "$line"
        if [[ $Continent =~ ^(Africa|Asia|Europe)$ && $pDeath =~ ^(0\.[5-9]|[1-9]) && $pSurvival =~ ^([2-9]\.|[1-9][0-9]) ]]; then
            diff=$(echo "$pSurvival - $pDeath" | bc)
            echo "$line,$diff"
        fi
    fi
done < input_file.txt > new_file.txt

Output:

Country,Other names,ISO 3166-1 alpha-3 CODE,Population,Continent,Total Cases,Total Deaths,Tot Cases//1M pop,Tot Deaths/1M pop,Death percentage, Survival Percentage,diff.porc.pts
Albania,Albania,ALB,2872296,Europe,273870,3492,95349,1216,1.27,9.41,8.14

It looks the record of Albania only meets the conditions contrary to the shown desired output.

  • Related