Home > Back-end >  Group range of values in a pivot table
Group range of values in a pivot table

Time:04-09

I'm currently working on a script for automating the creation of a pivot table from raw data file and I'm quiet happy with my current work.

But I would like to add another step after creating the pivot table. I need to import the data into excel and currently I would need to group values manually in excel, so it would be great to have this grouping action within my script.

My current pivot table looks kinda like this (the original is too big for posting it here):

    10 20 30 40 50
0,2  1  0  0  1  2
1,4  0  2  1  0  0
2,0  2  0  1  0  1
2,3  0  0  2  2  0
3,6  3  0  1  1  0
4,1  1  0  2  1  0

The result should look like this:

    10 20 30 40 50
0-2  3  2  2  1  3
2-4  3  0  3  3  0
> 4  1  0  2  1  0

So, the values in column 1 need to be grouped and their values in the table need to be added to each other.

My original table has field seperators (;) and the groups I need are 0-2, 2-4, 4-6, 6-8, 8-10 and > 10. So, like 0.2, 1.3, 1.9, 2.0 need to be grouped to "0-2".
2.2, 3.1, 3.6 to "2-4" and so on.

The values 2.0, 4.0, 6.0, 8.0 should be in the previous group, like 2.0 in "0-2", while the first value of "2-4" needs to be 2.1.

I can change the decimal seperator to dot if needed, it's just for excel which needs a comma when importing the table.

Any ideas how to manage this? Maybe using awk as I did in my script to create this pivot?

Thanks in advance, Jannomag

CodePudding user response:

$ cat tst.awk
BEGIN { beg=0; range=2; end=beg range; max=4 }
NR==1 { print; next }
$1 > end {
    prt()
    delete sum
    beg = end
    end  = range
}
{
    for (i=1; i<=NF; i  ) {
        sum[i]  = $i
    }
}
END { prt() }

function prt() {
    if ( beg == max ) {
        printf "> %d", beg
    }
    else {
        printf "%d-%d", beg, end
    }
    for (i=2; i<=NF; i  ) {
        printf "=", sum[i]
    }
    print ""
}

$ awk -f tst.awk file
    10 20 30 40 50
0-2  3  2  2  1  3
2-4  3  0  3  3  0
> 4  1  0  2  1  0

There may be an issue if you have values in your input greater than 4 but since those don't exist in the example you provided I wasn't able to test with it and I assume you can figure out how to handle it yourself given the above as a starting point.

CodePudding user response:

Assumptions/understandings:

  • objective is to collapse the input table into a series of summations based on what range the 1st column's value falls within
  • the 1st column's value is a float/real number; my locale uses the period as the decimal point so I'll convert commas to periods (I'll also assume that the real input file may contain floats/reals in other columns so I'll also convert commas to periods ... and hope there are no commas in use as the 1000's delimiter)
  • the 1st column's value falls within the range X-Y if value > X and value <= Y
  • ranges of interest: 0-2, 2-4, 4-6, etc
  • within a given range we sum the values in each of the columns #2 - #N
  • if an intermediate range does not have any rows from the input then we print said range with all 0's (eg, all input data falls within ranges 0-2 or 4-6 => we need to generate range 2-4 and set all sums=0)
  • input data may not be sorted by the 1st column so we'll need to accumlate the range/sums in memory (eg, an array)
  • all lines contain the same number of columns

Sample data:

$ cat pivot.dat
    10 20 30 40 50
0,2  1  0  0  1  2
1,4  0  2  1  0  0
3,6  3  0  1  1  0
7,1  1  0  2  1  0
2,3  0  0  2  2  0
2,0  2  0  1  0  1

One GNU awk idea:

awk '
FNR==1 { print; next }
       { gsub(/,/,".")                                                # convert commas to periods for my locale
                                                                      # remove the gsub() line if your locale can handle comma as the decimal point    

         endrange= int($1) - int($1)%2   ( $1==int($1) ? 0 : 2 )      # calculate upper range that $1 fits into
         maxrange=(endrange > maxrange ? endrange : maxrange)         # keep track of the max upper range that we have seen

         for (col=2; col<=NF; col  )                                  # loop through list of columns ...
             sums[endrange][col] =$col                                # summing up the values
       }

END    { for (endrange=2; endrange<=maxrange; endrange=endrange 2) {
             printf "%s-%s", (endrange-2), endrange
             for (col=2; col<=NF; col  )
                 printf "%s%s", OFS, sums[endrange][col] 0            # if this is an invalid array reference we can force the output==0 with the " 0"
             print ""
         }
       }
' pivot.dat

NOTES:

  • this requires GNU awk for multi-dimensional array support
  • OP has mentioned real input data actually uses the ; as a field delimiter; OP can update the code with the appropriate FS setting as needed
  • for simplicity I've used the default output field delimiter (OFS=" "); OP can update the code with an appropriate OFS setting as needed
  • OP has mentioned the input data was generated from a separate awk script; OP can decide if they wish to maintain 2 sets of awk scripts or attempt to consolidate into a single awk script

This generates:

    10 20 30 40 50
0-2 3 2 2 1 3
2-4 3 0 3 3 0
4-6 0 0 0 0 0                  # no input data for this range so we autofill with all 0's
6-8 1 0 2 1 0
  • Related