Most repeated string based on two arguments strings csv file. BASH-CodePudding

First exercise with bash, this is taking a lot of time...

I'm trying to create a script where, giving 2 arguments (height,weight) on athletes.csv returns number of coincidence on both values and predominant nationality based on that. And if that was not enough, if predominance is equal for 2 countries, then echo predominance with lowest id.

Also i can't use awk, grep, sed or csvkit.

Here is csv header:

id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,

Until now:

count=0

while IFS=, read -a id _ nation _ _ height weight _ _ _ _; do

    if (( $height == "$2" )) && (( "$weight" == $3 )) ; then

        ((count  ))
    fi

done < athletes.csv

echo "$count"

I have seen a similiar problem. But can't find the way to return the most common nationality (strings).

Looking for something similar to:

Count, Predominant_nationality 1.85 130
8460, BRA

Should i try to work the hole exercise with arrays instead trying with lopps? Probably i could do it indexes, but looks like arrays are 1d here?

Any help would be a blessing

CodePudding user response：

It's a problem of sorting and counting that can be resolved with Linux standard text utilities

csv='athletes.csv'
crit='1\.85,90'

echo "Count Predominant_nationality $crit"
# Get fields from csv and sort on filtered fields 2,3
cut -d ',' -f 1,3,6,7 "$csv" | grep "$crit" | sort -t ',' -k2,3 | tr ',' ' ' | \
# Count unique skipping first field, get first
uniq -f 1 -c | sort -n -k1,1nr -k2n | head -n1 | tr -s ' ' | \
# print result
cut -d ' ' -f 2,4 --output-delimiter='    '

Result

Count Predominant_nationality 1.85,90
2    BRA

CodePudding user response：

A few issues with current code:

read -a says to read values into an array but what you really want is to read the values into individual variables
read -r is typical in this type of scenario (-r disables backslashes as escapes)
the if (( ... )) constucts are typically used for integer comparisons and since heights are non-integer (eg, 1.85) it's probably best to stick with string comparisons (especially since we're only interested in equality matches)

Setup; instead of downloading the link/data file I'll add 4x bogus lines to OP's sample input, making sure all 4x lines match OP's sample search parameters (1.85 and 130):

$ cat athletes.csv
id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,
134,Aaron XX1,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
127,Aaron XX2,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
34,Aaron XX3,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
27,Aaron XX4,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,

One bash idea:

arg1="1.85"
arg2="130"

maxid=99999999999

unset counts ids maxcount
declare -A counts ids
maxcount=0

while IFS=, read -r id _ nation _ _ height weight _
do
    if [[ "${height}" == "${arg1}" && "${weight}" == "${arg2}" ]]
    then
        (( counts[${nation}]   ))

        # keep track of overall max count

        [[ "${counts[${nation}]}" -gt "${maxcount}"  ]] && maxcount="${counts[${nation}]}"

        # keep track of min(id) for each nation

        [[ "${id}" -lt "${ids[${nation}]:-${maxid}}" ]] && ids[${nation}]="${id}"
    fi
done < athletes.csv

Alternatively, since it looks like our search patterns are together and can only occur in one location within a line, we can use grep to filter out only the matching rows:

$ grep ",${arg1},${arg2}," athletes.csv
134,Aaron XX1,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
127,Aaron XX2,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
34,Aaron XX3,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
27,Aaron XX4,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,

We can then feed this result to the while/read loop and eliminte the need to test the height/weight variables, eg:

while IFS=, read -r id _ nation _
do
   (( counts[${nation}]   ))
   [[ "${counts[${nation}]}" -gt "${maxcount}"  ]] && maxcount="${counts[${nation}]}"
   [[ "${id}" -lt "${ids[${nation}]:-${maxid}}" ]] && ids[${nation}]="${id}"
done < <(grep ",${arg1},${arg2}," athletes.csv)

At this point both of these while/read loops produce:

$ typeset -p counts ids maxcount
declare -A counts=([USA]="2" [CAD]="2" )
declare -A ids=([USA]="34" [CAD]="27" )
declare -- maxcount="2"

From here OP can loop through the list of nations ("${!counts[@]}") looking for counts that are equal to maxcount and when found then apply an additional check to see if the nation has the lowest id (ids[]) seen so far in the loop. At the end of the loop OP should have the country a) count that is equal to maxcount and b) with the lowest id.

CodePudding user response：

You can try rq (https://github.com/fuyuncat/rquery/releases)
counta(;1) is doing counting, mina(;id) returns the smallest id, f height=@h and weight=@w filters the records with given arguments, e @2=@3 trim @1, @4, @h, @w matches the minimum id and display the result.

[ rquery]$ ./rq -n -v "h:1.85;w:130" -q "p d/,/\"\"/ | s counta(;1) ,mina(;id),id, nationality | f height=@h and weight=@w | e @2=@3 trim @1, @4, @h, @w " samples/athletes.csv
2       HON     1.85    130