First exercise with bash, this is taking a lot of time...
I'm trying to create a script where, giving 2 arguments (height,weight) on athletes.csv returns number of coincidence on both values and predominant nationality based on that. And if that was not enough, if predominance is equal for 2 countries, then echo
predominance with lowest id.
Also i can't use awk, grep, sed or csvkit.
Here is csv header:
id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,
Until now:
count=0
while IFS=, read -a id _ nation _ _ height weight _ _ _ _; do
if (( $height == "$2" )) && (( "$weight" == $3 )) ; then
((count ))
fi
done < athletes.csv
echo "$count"
I have seen a similiar problem. But can't find the way to return the most common nationality (strings).
Looking for something similar to:
Count, Predominant_nationality 1.85 130
8460, BRA
Should i try to work the hole exercise with arrays instead trying with lopps? Probably i could do it indexes, but looks like arrays are 1d here?
Any help would be a blessing
CodePudding user response:
It's a problem of sorting and counting that can be resolved with Linux standard text utilities
csv='athletes.csv'
crit='1\.85,90'
echo "Count Predominant_nationality $crit"
# Get fields from csv and sort on filtered fields 2,3
cut -d ',' -f 1,3,6,7 "$csv" | grep "$crit" | sort -t ',' -k2,3 | tr ',' ' ' | \
# Count unique skipping first field, get first
uniq -f 1 -c | sort -n -k1,1nr -k2n | head -n1 | tr -s ' ' | \
# print result
cut -d ' ' -f 2,4 --output-delimiter=' '
Result
Count Predominant_nationality 1.85,90
2 BRA
CodePudding user response:
A few issues with current code:
read -a
says to read values into an array but what you really want is to read the values into individual variablesread -r
is typical in this type of scenario (-r
disables backslashes as escapes)- the
if (( ... ))
constucts are typically used for integer comparisons and since heights are non-integer (eg,1.85
) it's probably best to stick with string comparisons (especially since we're only interested in equality matches)
Setup; instead of downloading the link/data file I'll add 4x bogus lines to OP's sample input, making sure all 4x lines match OP's sample search parameters (1.85
and 130
):
$ cat athletes.csv
id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,
134,Aaron XX1,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
127,Aaron XX2,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
34,Aaron XX3,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
27,Aaron XX4,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
One bash
idea:
arg1="1.85"
arg2="130"
maxid=99999999999
unset counts ids maxcount
declare -A counts ids
maxcount=0
while IFS=, read -r id _ nation _ _ height weight _
do
if [[ "${height}" == "${arg1}" && "${weight}" == "${arg2}" ]]
then
(( counts[${nation}] ))
# keep track of overall max count
[[ "${counts[${nation}]}" -gt "${maxcount}" ]] && maxcount="${counts[${nation}]}"
# keep track of min(id) for each nation
[[ "${id}" -lt "${ids[${nation}]:-${maxid}}" ]] && ids[${nation}]="${id}"
fi
done < athletes.csv
Alternatively, since it looks like our search patterns are together and can only occur in one location within a line, we can use grep
to filter out only the matching rows:
$ grep ",${arg1},${arg2}," athletes.csv
134,Aaron XX1,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
127,Aaron XX2,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
34,Aaron XX3,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
27,Aaron XX4,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
We can then feed this result to the while/read
loop and eliminte the need to test the height/weight
variables, eg:
while IFS=, read -r id _ nation _
do
(( counts[${nation}] ))
[[ "${counts[${nation}]}" -gt "${maxcount}" ]] && maxcount="${counts[${nation}]}"
[[ "${id}" -lt "${ids[${nation}]:-${maxid}}" ]] && ids[${nation}]="${id}"
done < <(grep ",${arg1},${arg2}," athletes.csv)
At this point both of these while/read
loops produce:
$ typeset -p counts ids maxcount
declare -A counts=([USA]="2" [CAD]="2" )
declare -A ids=([USA]="34" [CAD]="27" )
declare -- maxcount="2"
From here OP can loop through the list of nations ("${!counts[@]}"
) looking for counts that are equal to maxcount
and when found then apply an additional check to see if the nation has the lowest id (ids[]
) seen so far in the loop. At the end of the loop OP should have the country a) count that is equal to maxcount
and b) with the lowest id.
CodePudding user response:
You can try rq
(https://github.com/fuyuncat/rquery/releases)
counta(;1)
is doing counting, mina(;id)
returns the smallest id, f height=@h and weight=@w
filters the records with given arguments, e @2=@3 trim @1, @4, @h, @w
matches the minimum id and display the result.
[ rquery]$ ./rq -n -v "h:1.85;w:130" -q "p d/,/\"\"/ | s counta(;1) ,mina(;id),id, nationality | f height=@h and weight=@w | e @2=@3 trim @1, @4, @h, @w " samples/athletes.csv
2 HON 1.85 130