If I have a set of data that has repeating name values but with different variations per repeating value, how can I sort by the top of each of those repeating values? Hopefully that made sense, but I hope to demonstrate what I mean further below.
Take for example this set of data in a tab separated csv file
Ranking ID Year Make Model Total
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
3 15 014 Audi S4 120
4 113 2012 Acura Tsx sportwagon 116
5 83 2014 Honda Accord 112
6 112 2008 Acura TL 110
7 65 2009 Honda Fit 106
8 91 2010 Mitsu Lancer 102
9 50 2015 Acura TLX 102
10 31 2007 Honda Fit 102
11 216 2007 Chrystler 300 96
12 126 2010 Volkswagen Eos 92
13 13 2016 Honda Civic 1.5t 92
If you look in the Make column, you can see names like Acura and Honda repeat, with differences in the Model and Total column. Assume that there's 200 or so rows of this in the csv file. How can I sort the file so that the items are grouped by Make with only three of the highest in value under the Total column being displayed by each Make?
Expected output below
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 102
4 83 2014 Honda Accord 112
5 31 2007 Honda Fit 102
6 13 2016 Honda Civic 1.5t 92
...
Here is my awk code so far, I can't get past this part to even attempt grouping the makes by total column
BEGIN {
FS = OFS = "\t";
}
FNR == 1 {
print;
next;
}
FNR > 1 {
a[NR] = $4;
}
END {
PROCINFO["sorted_in"] = "@val_str_desc"
for(i = 1; i < FN-1; i ) {
print a[i];
}
}
Currently, my code reads the text file, prints the headers (column titles) and then stops there, it doesn't go on to print out the rest of the data in alphabetical order. Any ideas?
CodePudding user response:
The following assumes bash (if you don't use bash replace $'\t'
by a quoted real tab character) and GNU coreutils. It also assumes that you want to sort alphabetically by Make
column first, then numerically in decreasing order by Total
, and finally keep at most the first 3 of each Make
entries.
Sorting is a job for sort
, head
and tail
can be used to isolate the header line, and awk
can be used to keep maximum 3 of each Make
, and re-number the first column:
$ head -n1 data.tsv; tail -n 2 data.tsv | sort -t$'\t' -k4,4 -k6,6rn |
awk -F'\t' -vOFS='\t' '$4==p {n =1} $4!=p {n=1;p=$4} {$1= r} n<=3'
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 102
4 15 014 Audi S4 120
5 216 2007 Chrystler 300 96
6 83 2014 Honda Accord 112
7 65 2009 Honda Fit 106
8 31 2007 Honda Fit 102
10 128 2010 Infiniti G37 128
11 124 2015 Jeep Wrangler 124
12 91 2010 Mitsu Lancer 102
13 126 2010 Volkswagen Eos 92
Note that this is different from your expected output: Make
is sorted in alphabetic order (Audi
comes after Acura
, not Honda
) and only the 3 largest Total
are kept (112, 106, 102
for Honda
, not 112, 102, 92
).
If you use GNU awk
, and your input file is small enough to fit in memory, you can also do all this with just awk
, thanks to its multidimensional arrays and its asorti
function, that sorts arrays based on indices:
$ awk -F'\t' -vOFS='\t' 'NR==1 {print; next} {l[$4][$6][$0]}
END {
PROCINFO["sorted_in"] = "@ind_str_asc"
for(m in l) {
n = asorti(l[m], t, "@ind_num_desc"); n = (n>3) ? 3 : n
for(i=1; i<=n; i ) for(s in l[m][t[i]]) {$0 = s; $1 = r; print}
}
}' data.tsv
Ranking ID Year Make Model Total
1 113 2012 Acura Tsx sportwagon 116
2 112 2008 Acura TL 110
3 50 2015 Acura TLX 102
4 15 014 Audi S4 120
5 216 2007 Chrystler 300 96
6 83 2014 Honda Accord 112
7 65 2009 Honda Fit 106
8 31 2007 Honda Fit 102
9 128 2010 Infiniti G37 128
10 124 2015 Jeep Wrangler 124
11 91 2010 Mitsu Lancer 102
12 126 2010 Volkswagen Eos 92
CodePudding user response:
Do you want to do the whole operation in awk? Would this suffice?
cat file.txt
Ranking ID Year Make Model Total
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
3 15 014 Audi S4 120
4 113 2012 Acura Tsx sportwagon 116
5 83 2014 Honda Accord 112
6 112 2008 Acura TL 110
7 65 2009 Honda Fit 106
8 91 2010 Mitsu Lancer 102
9 50 2015 Acura TLX 102
10 31 2007 Honda Fit 102
11 216 2007 Chrystler 300 96
12 126 2010 Volkswagen Eos 92
13 13 2016 Honda Civic 1.5t 92
awk -F"\t" '{ n[$4]; if(n[$4] < 4) print $0}' file.txt > tmp && (head -1 tmp; tail -n 2 tmp | sort -t$'\t' -k4,4 -k6,6nr)
Ranking ID Year Make Model Total
4 113 2012 Acura Tsx sportwagon 116
6 112 2008 Acura TL 110
9 50 2015 Acura TLX 102
3 15 014 Audi S4 120
11 216 2007 Chrystler 300 96
5 83 2014 Honda Accord 112
7 65 2009 Honda Fit 106
10 31 2007 Honda Fit 102
1 128 2010 Infiniti G37 128
2 124 2015 Jeep Wrangler 124
8 91 2010 Mitsu Lancer 102
12 126 2010 Volkswagen Eos 92