I would like to catch the median value from second column of this CSV file grouped by Continent in a script awk file:
Continent, Ratio
Asia,4.313743132
Europe,1.275057509
Africa,2.587215976
Europe,0.382270638
Africa,1.915438434
Latin America and the Caribbean,1.801681569
Latin America and the Caribbean,1.416472111
Asia,2.039169471
Latin America and the Caribbean,0.622595518
I tried but it doesn't work because I didn't sort CSV and I didn't even group by continent:
BEGIN {
FS = ","
Continent["Africa"];Continent["Asia"];Continent["Europe"];Continent["Latin America and the Caribbean"];
}
FNR>1 {
if ($1!="" && $2!="") {
a[i ]=$2
}
}END {
for (i in Continent) {
x=int((i 1)/2); if (x < (i 1)/2) print (a[x-1] a[x])/2; else print a[x-1];
}
}
How ever, I expect to order first of all the csv by Continent and Numeric (desc or asc, doesn't matter):
Continent, Ratio
Africa,2.587215976
Africa,1.915438434
Asia,4.313743132
Asia,2.039169471
Europe,1.275057509
Europe,0.382270638
Latin America and the Caribbean,1.801681569
Latin America and the Caribbean,1.416472111
Latin America and the Caribbean,0.622595518
And finally getting the median by Continent:
Continent, Median
Africa 2,2513
Asia 3,1764
Europe 0,8286
Latin America and the Caribbean, 1.4164
CodePudding user response:
Using gnu-awk
you may use this solution:
cat med.awk
function med(d) {
return gensub(/\./, ",", "1", sprintf("%.4f",
(n%2 ? d[(n 1)/2] : (d[n/2] d[n/2 1]) / 2)))
}
BEGIN {
FS = ","
OFS = "\t"
}
NR > 1 {
a[$0] = $2
}
END {
PROCINFO["sorted_in"] = "@ind_num_asc" # this is for sorting
print "Continent", "Median
for (i in a) {
k = gensub(/,.*/, "", "1", i)
if (k != pk) {
if (pk)
print pk, med(d)
n = 0
delete d
}
d[ n] = a[i]
pk = k
}
print pk, med(d)
}
Then use it as:
awk -f med.awk file | column -t -s $'\t'
Continent Median
Africa 2,2513
Asia 3,1765
Europe 0,8287
Latin America and the Caribbean 1,4165
CodePudding user response:
Just for fun, you could write this in a single gawk
script, but in coordination with sort
, using a two-way pipe (or coprocess):
median.awk
#!/usr/bin/env gawk -f
function median(values, _, len) {
len = length(values)
if (len % 2 == 1) {
return values[(len 1) / 2]
} else {
return (values[int(len / 2)] values[int(len / 2) 1]) / 2
}
}
BEGIN {
# Set (Input) Field and Output Field Separators.
FS = OFS = ","
# The `sort` process to which we'll send data.
#
# -t, → set command as the input separator
# -k1,1 → 1st sort key starts at the 1st field and ends at the 1st field
# -k2,2n → 2nd sort key starts at the 2nd field and ends at the 2nd field
# and use numeric comparisons.
sorted = "LC_ALL=C sort -t, -k1,1 -k2,2n"
}
NR == 1 {
next # skip first line
}
{
# Pipe incoming lines to a sorting coprocess.
# See: https://www.gnu.org/software/gawk/manual/html_node/Two_002dway-I_002fO.html
print $0 |& sorted
}
END {
# Close the writing end of the `sorted` pipe.
close(sorted, "to")
while (1) {
# Read sorted data.
more = sorted |& getline
# Print median when there's no more data or when we're switching between regions.
if (!more || (region && region != $1)) {
print region, sprintf("%.6f", median(vals))
vals_i = 0
delete vals
}
# Clean up when there's no more data.
if (!more) {
close(sorted)
break
}
region = $1
vals[ vals_i] = $2
}
}
Result
$ ./median.awk data.txt
Africa,2.251327
Asia,3.176456
Europe,0.828664
Latin America and the Caribbean,1.416472