Home > Software engineering >  Converting data in rows to columns
Converting data in rows to columns

Time:12-05

My input tab-delimited file is like this:

13435    830169  830264  a    95   y    16
09433    835620  835672  x    46
30945    838405  838620  a    21   c    19
94853    850475  850660  y    15
04958    865700  865978  c    16   a    98

After first three columns, the file shows variable and its value in the next column. I need to change data structure so that after first three columns, there are columns for variables like this:

                         a    x    y    c   
13435    830169  830264  95        16
09433    835620  835672       46
30945    838405  838620  21             19
94853    850475  850660            15
04958    865700  865978  98             16

Is there any code to do this on linux? The file size is 7.6 MB and the total number of lines are around 450,000. The total number of variables are four.

Thank you

CodePudding user response:

In pure bash (requires bash 4.0 or newer):

#!/bin/bash

declare -A var

printf '\t\t\ta\tx\ty\tc\n'
while IFS=$'\t' read -ra fld; do
    var[a]=""  var[x]=""  var[y]=""  var[c]=""
    for ((i = 3; i < ${#fld[@]}; i  = 2)); do
        var["${fld[i]}"]=${fld[i   1]}
    done
    printf '%s\t' "${fld[@]:0:3}"
    printf '%s\t%s\t%s\t%s\n' "${var[a]}" "${var[x]}" "${var[y]}" "${var[c]}"
done < file

CodePudding user response:

If you know you have 4 variables a, x, y, c, and the file is formatted as a tab separated file and you want the exact format as output shown, you can simply use a "Brute Force" method where you check the contents of fields 4 and 6 for the variable name and output the value of field 5 or 7 formatted as shown using printf.

For example, knowing the variable names you can simply output the header row before then processing each records as follows:

awk -F"\t" '
  FNR==1 { 
    print "\t\t\t  a    x    y    c"
  }
  {
    printf "%-8s%8s%8s  ", $1, $2, $3
    
    if ($4=="a")
      printf "%-5s", $5
    else if ($6=="a")
      printf "%-5s", $7
    else
      printf "%-5s", " "
    
    if ($4=="x")
      printf "%-5s", $5
    else if ($6=="x")
      printf "%-5s", $7
    else
      printf "%-5s", " "
    
    if ($4=="y")
      printf "%-5s", $5
    else if ($6=="y")
      printf "%-5s", $7
    else
      printf "%-5s", " "
    
    if ($4=="c")
      printf "%-5s\n", $5
    else if ($6=="c")
      printf "%-5s\n", $7
    else
      print ""
  }
' tabfile

Example Use/Output

With your input in tabfile you would have:

$ awk -F"\t" '
>   FNR==1 {
>     print "\t\t\t  a    x    y    c"
>   }
>   {
>     printf "%-8s%8s%8s  ", $1, $2, $3
>
>     if ($4=="a")
>       printf "%-5s", $5
>     else if ($6=="a")
>       printf "%-5s", $7
>     else
>       printf "%-5s", " "
>
>     if ($4=="x")
>       printf "%-5s", $5
>     else if ($6=="x")
>       printf "%-5s", $7
>     else
>       printf "%-5s", " "
>
>     if ($4=="y")
>       printf "%-5s", $5
>     else if ($6=="y")
>       printf "%-5s", $7
>     else
>       printf "%-5s", " "
>
>     if ($4=="c")
>       printf "%-5s\n", $5
>     else if ($6=="c")
>       printf "%-5s\n", $7
>     else
>       print ""
>   }
> ' tabfile
                          a    x    y    c
13435     830169  830264  95        16
09433     835620  835672       46
30945     838405  838620  21             19
94853     850475  850660            15
04958     865700  865978  98             16

Which provides the desired output. This one-pass approach will also be quite efficient for 450,000 lines of input. Since this is a bit long for a command-line script, you can simply put it in an awk script and call it with the filename. Let me know if you have questions.

As A Script File

Using as a script file, simply put the contents in a file and make it executable, e.g.

#!/usr/bin/awk -f

BEGIN { FS="\t" }
FNR==1 { 
  print "\t\t\t  a    x    y    c"
}
{
  printf "%-8s%8s%8s  ", $1, $2, $3
  
  if ($4=="a")
    printf "%-5s", $5
  else if ($6=="a")
    printf "%-5s", $7
  else
    printf "%-5s", " "
  
  if ($4=="x")
    printf "%-5s", $5
  else if ($6=="x")
    printf "%-5s", $7
  else
    printf "%-5s", " "
  
  if ($4=="y")
    printf "%-5s", $5
  else if ($6=="y")
    printf "%-5s", $7
  else
    printf "%-5s", " "
  
  if ($4=="c")
    printf "%-5s\n", $5
  else if ($6=="c")
    printf "%-5s\n", $7
  else
    print ""
}

Saved as awkscript you would chmod x awkscript and then run:

$ ./awkscript tabfile
                          a    x    y    c
13435     830169  830264  95        16
09433     835620  835672       46
30945     838405  838620  21             19
94853     850475  850660            15
04958     865700  865978  98             16

CodePudding user response:

input="\
13435   830169  830264  a   95  y   16
09433   835620  835672  x   46
30945   838405  838620  a   21  c   19
94853   850475  850660  y   15
04958   865700  865978  c   16  a   98
"

with awk:

printf '\t\t\ta\tx\ty\tc\n'
echo -n "$input" |
awk -v vars='a x y c' '
  BEGIN {NV = split(vars,V)}
  {
     s = $1 "\t" $2 "\t" $3;
     delete a;
     for(i = 4; i < NF; i = i 2) a[$i] = $(i 1);
     for(i = 1; i <= NV; i  ) s = s "\t" a[V[i]];
     print s
  }
'

with ruby:

printf '\t\t\ta\tx\ty\tc\n'
echo -n "$input" |
vars='a x y c' ruby -ane '
    BEGIN{v = ENV["vars"].split};
    h = Hash[*$F[3..-1]];
    puts $F[0..2].concat(v.map{|v| h[v]}).join("\t")
'

outputs:

            a   x   y   c
13435   830169  830264  95      16  
09433   835620  835672      46      
30945   838405  838620  21          19
94853   850475  850660          15  
04958   865700  865978  98          16

CodePudding user response:

In perl:

$ perl -lane '
    BEGIN { print join("\t", "", "", "", "a", "x", "y", "c"); }
    my %vars = @F[3..$#F];
    print join("\t", @F[0..2], @vars{qw/a x y c/});
  ' input.tsv
                        a       x       y       c
13435   830169  830264  95              16
09433   835620  835672          46
30945   838405  838620  21                      19
94853   850475  850660                  15
04958   865700  865978  98                      16

The fourth column and all following are taken as key/value pairs for a hash table, and then the variable values that are present in it are extracted in the right order, along with the first three columns. Makes heavy use of slices.

  • Related