Home > database >  Count occurrences in a csv with Bash
Count occurrences in a csv with Bash

Time:03-19

I have to create a script that given a country and a sport you get the number of medalists and medals won after reading a csv file.

The csv is called "athletes.csv" and have this header

id|name|nationality|sex|date_of_birth|height|weight|sport|gold|silver|bronze|info

when you call the script you have to add the nationality and sport as parameters.

The script i have created is this one:

#!/bin/bash
participants=0
medals=0
while IFS=, read -ra array
do
    if [[ "${array[2]}" == $1 && "${array[7]}" == $2 ]]
    then
        participants=$participants  
        medals=$(($medals ${array[8]} ${array[9]} ${array[10]))
    fi
done < athletes.csv
echo $participants
echo $medals

where array[3] is the nationality, array[8] is the sport and array[9] to [11] are the number of medals won.

When i run the script with the correct paramters I get 0 participants and 0 medals. Could you help me to understand what I'm doing wrong? Note I cannot use awk nor grep

Thanks in advance

CodePudding user response:

Here's a fixed version of your code:

#!/bin/bash

participants=0
medals=0
{
    # get rid of the header
    read

    # process the records
    while IFS=',' read -ra array
    do
        if [[ "${array[2]}" == $1 && "${array[7]}" == $2 ]]
        then
            (( participants   ))
            medals=$(( medals   array[8]   array[9]   array[10] ))
        fi
    done
} < athletes.csv

echo "$participants" "$medals"

remark: As $1 and $2 are unquoted they are subject to glob matching in the context that they are used. for example, for showing the total number of medals won by the US you could do:

./script.sh '[Uu][Ss]' '*'

Now I have to say, doing text processing with pure shell isn't exactly a good practice; there exists dedicated tools for that. Here's an example with awk:

awk -v FS=',' -v country="$1" -v sport="$2" '
    BEGIN {
        participants = medals = 0
    }
    NR == 1 { next }
    $3 == country && $8 == sport {
        participants  
        medals  = $9   $10   $11
    }
    END { print participants, medals }
' athletes.csv

Now the last potential problem: the CSV format can be tricky to read and might need a real CSV parser. There are some awk libraries for that but IMO it's simpler to use a CSV-aware tool that provides the functionalities that you need. Here's an example with Miller:

mlr --icsv filter -s country="$1" -s sport="$2" '
    begin {
        @participants = 0;
        @medals = 0;
    }
    $nationality == @country && $sport == @sport {
        @participants  = 1;
        @medals  = $gold   $silver   $bronze;
    }
    false;
    end { print @participants, @medals; }
' athletes.csv

CodePudding user response:

Try this:

#! /bin/bash -p

nation_arg=$1
sport_arg=$2

declare -i participants=0
declare -i medals=0
declare -i line_num=0

while IFS=, read -r _ _ nation _ _ _ _ sport ngold nsilver nbronze _; do
    ((   line_num == 1 )) && continue   # Skip the header
    [[ $nation == "$nation_arg" && $sport == "$sport_arg" ]] || continue
    participants =1
    medals =ngold nsilver nbronze
done <athletes.csv
declare -p participants
declare -p medals
  • The code uses named variables instead of numbered positional parameters and array indexes to try to improve readability and maintainability.
  • Using declare -i means that strings assigned to the declared variables are treated as arithmetic expressions. That reduces clutter by avoiding the need for $(( ... )).
  • The code assumes that the field separator in the CSV file is ,, not | as in the header. If the separator is really |, replace IFS=, with IFS='|'.
  •  Tags:  
  • bash
  • Related