I'm currently working on a script for automating the creation of a pivot table from raw data file and I'm quiet happy with my current work.
But I would like to add another step after creating the pivot table. I need to import the data into excel and currently I would need to group values manually in excel, so it would be great to have this grouping action within my script.
My current pivot table looks kinda like this (the original is too big for posting it here):
10 20 30 40 50
0,2 1 0 0 1 2
1,4 0 2 1 0 0
2,0 2 0 1 0 1
2,3 0 0 2 2 0
3,6 3 0 1 1 0
4,1 1 0 2 1 0
The result should look like this:
10 20 30 40 50
0-2 3 2 2 1 3
2-4 3 0 3 3 0
> 4 1 0 2 1 0
So, the values in column 1 need to be grouped and their values in the table need to be added to each other.
My original table has field seperators (;) and the groups I need are 0-2, 2-4, 4-6, 6-8, 8-10 and > 10.
So, like 0.2, 1.3, 1.9, 2.0 need to be grouped to "0-2".
2.2, 3.1, 3.6 to "2-4" and so on.
The values 2.0, 4.0, 6.0, 8.0 should be in the previous group, like 2.0 in "0-2", while the first value of "2-4" needs to be 2.1.
I can change the decimal seperator to dot if needed, it's just for excel which needs a comma when importing the table.
Any ideas how to manage this? Maybe using awk
as I did in my script to create this pivot?
Thanks in advance, Jannomag
CodePudding user response:
$ cat tst.awk
BEGIN { beg=0; range=2; end=beg range; max=4 }
NR==1 { print; next }
$1 > end {
prt()
delete sum
beg = end
end = range
}
{
for (i=1; i<=NF; i ) {
sum[i] = $i
}
}
END { prt() }
function prt() {
if ( beg == max ) {
printf "> %d", beg
}
else {
printf "%d-%d", beg, end
}
for (i=2; i<=NF; i ) {
printf "=", sum[i]
}
print ""
}
$ awk -f tst.awk file
10 20 30 40 50
0-2 3 2 2 1 3
2-4 3 0 3 3 0
> 4 1 0 2 1 0
There may be an issue if you have values in your input greater than 4 but since those don't exist in the example you provided I wasn't able to test with it and I assume you can figure out how to handle it yourself given the above as a starting point.
CodePudding user response:
Assumptions/understandings:
- objective is to collapse the input table into a series of summations based on what
range
the 1st column's value falls within - the 1st column's value is a float/real number; my locale uses the period as the decimal point so I'll convert commas to periods (I'll also assume that the real input file may contain floats/reals in other columns so I'll also convert commas to periods ... and hope there are no commas in use as the 1000's delimiter)
- the 1st column's value falls within the
range X-Y
ifvalue > X
andvalue <= Y
- ranges of interest:
0-2
,2-4
,4-6
, etc - within a given
range
we sum the values in each of the columns #2 - #N - if an intermediate range does not have any rows from the input then we print said range with all 0's (eg, all input data falls within ranges
0-2
or4-6
=> we need to generate range2-4
and set all sums=0) - input data may not be sorted by the 1st column so we'll need to accumlate the range/sums in memory (eg, an array)
- all lines contain the same number of columns
Sample data:
$ cat pivot.dat
10 20 30 40 50
0,2 1 0 0 1 2
1,4 0 2 1 0 0
3,6 3 0 1 1 0
7,1 1 0 2 1 0
2,3 0 0 2 2 0
2,0 2 0 1 0 1
One GNU awk
idea:
awk '
FNR==1 { print; next }
{ gsub(/,/,".") # convert commas to periods for my locale
# remove the gsub() line if your locale can handle comma as the decimal point
endrange= int($1) - int($1)%2 ( $1==int($1) ? 0 : 2 ) # calculate upper range that $1 fits into
maxrange=(endrange > maxrange ? endrange : maxrange) # keep track of the max upper range that we have seen
for (col=2; col<=NF; col ) # loop through list of columns ...
sums[endrange][col] =$col # summing up the values
}
END { for (endrange=2; endrange<=maxrange; endrange=endrange 2) {
printf "%s-%s", (endrange-2), endrange
for (col=2; col<=NF; col )
printf "%s%s", OFS, sums[endrange][col] 0 # if this is an invalid array reference we can force the output==0 with the " 0"
print ""
}
}
' pivot.dat
NOTES:
- this requires
GNU awk
for multi-dimensional array support - OP has mentioned real input data actually uses the
;
as a field delimiter; OP can update the code with the appropriateFS
setting as needed - for simplicity I've used the default output field delimiter (
OFS=" "
); OP can update the code with an appropriateOFS
setting as needed - OP has mentioned the input data was generated from a separate
awk
script; OP can decide if they wish to maintain 2 sets ofawk
scripts or attempt to consolidate into a singleawk
script
This generates:
10 20 30 40 50
0-2 3 2 2 1 3
2-4 3 0 3 3 0
4-6 0 0 0 0 0 # no input data for this range so we autofill with all 0's
6-8 1 0 2 1 0