My input tab-delimited file is like this:
13435 830169 830264 a 95 y 16
09433 835620 835672 x 46
30945 838405 838620 a 21 c 19
94853 850475 850660 y 15
04958 865700 865978 c 16 a 98
After first three columns, the file shows variable and its value in the next column. I need to change data structure so that after first three columns, there are columns for variables like this:
a x y c
13435 830169 830264 95 16
09433 835620 835672 46
30945 838405 838620 21 19
94853 850475 850660 15
04958 865700 865978 98 16
Is there any code to do this on linux? The file size is 7.6 MB and the total number of lines are around 450,000. The total number of variables are four.
Thank you
CodePudding user response:
In pure bash (requires bash 4.0 or newer):
#!/bin/bash
declare -A var
printf '\t\t\ta\tx\ty\tc\n'
while IFS=$'\t' read -ra fld; do
var[a]="" var[x]="" var[y]="" var[c]=""
for ((i = 3; i < ${#fld[@]}; i = 2)); do
var["${fld[i]}"]=${fld[i 1]}
done
printf '%s\t' "${fld[@]:0:3}"
printf '%s\t%s\t%s\t%s\n' "${var[a]}" "${var[x]}" "${var[y]}" "${var[c]}"
done < file
CodePudding user response:
If you know you have 4 variables a
, x
, y
, c
, and the file is formatted as a tab separated file and you want the exact format as output shown, you can simply use a "Brute Force" method where you check the contents of fields 4
and 6
for the variable name and output the value of field 5
or 7
formatted as shown using printf
.
For example, knowing the variable names you can simply output the header row before then processing each records as follows:
awk -F"\t" '
FNR==1 {
print "\t\t\t a x y c"
}
{
printf "%-8s%8s%8s ", $1, $2, $3
if ($4=="a")
printf "%-5s", $5
else if ($6=="a")
printf "%-5s", $7
else
printf "%-5s", " "
if ($4=="x")
printf "%-5s", $5
else if ($6=="x")
printf "%-5s", $7
else
printf "%-5s", " "
if ($4=="y")
printf "%-5s", $5
else if ($6=="y")
printf "%-5s", $7
else
printf "%-5s", " "
if ($4=="c")
printf "%-5s\n", $5
else if ($6=="c")
printf "%-5s\n", $7
else
print ""
}
' tabfile
Example Use/Output
With your input in tabfile
you would have:
$ awk -F"\t" '
> FNR==1 {
> print "\t\t\t a x y c"
> }
> {
> printf "%-8s%8s%8s ", $1, $2, $3
>
> if ($4=="a")
> printf "%-5s", $5
> else if ($6=="a")
> printf "%-5s", $7
> else
> printf "%-5s", " "
>
> if ($4=="x")
> printf "%-5s", $5
> else if ($6=="x")
> printf "%-5s", $7
> else
> printf "%-5s", " "
>
> if ($4=="y")
> printf "%-5s", $5
> else if ($6=="y")
> printf "%-5s", $7
> else
> printf "%-5s", " "
>
> if ($4=="c")
> printf "%-5s\n", $5
> else if ($6=="c")
> printf "%-5s\n", $7
> else
> print ""
> }
> ' tabfile
a x y c
13435 830169 830264 95 16
09433 835620 835672 46
30945 838405 838620 21 19
94853 850475 850660 15
04958 865700 865978 98 16
Which provides the desired output. This one-pass approach will also be quite efficient for 450,000 lines of input. Since this is a bit long for a command-line script, you can simply put it in an awk script and call it with the filename. Let me know if you have questions.
As A Script File
Using as a script file, simply put the contents in a file and make it executable, e.g.
#!/usr/bin/awk -f
BEGIN { FS="\t" }
FNR==1 {
print "\t\t\t a x y c"
}
{
printf "%-8s%8s%8s ", $1, $2, $3
if ($4=="a")
printf "%-5s", $5
else if ($6=="a")
printf "%-5s", $7
else
printf "%-5s", " "
if ($4=="x")
printf "%-5s", $5
else if ($6=="x")
printf "%-5s", $7
else
printf "%-5s", " "
if ($4=="y")
printf "%-5s", $5
else if ($6=="y")
printf "%-5s", $7
else
printf "%-5s", " "
if ($4=="c")
printf "%-5s\n", $5
else if ($6=="c")
printf "%-5s\n", $7
else
print ""
}
Saved as awkscript
you would chmod x awkscript
and then run:
$ ./awkscript tabfile
a x y c
13435 830169 830264 95 16
09433 835620 835672 46
30945 838405 838620 21 19
94853 850475 850660 15
04958 865700 865978 98 16
CodePudding user response:
input="\
13435 830169 830264 a 95 y 16
09433 835620 835672 x 46
30945 838405 838620 a 21 c 19
94853 850475 850660 y 15
04958 865700 865978 c 16 a 98
"
with awk
:
printf '\t\t\ta\tx\ty\tc\n'
echo -n "$input" |
awk -v vars='a x y c' '
BEGIN {NV = split(vars,V)}
{
s = $1 "\t" $2 "\t" $3;
delete a;
for(i = 4; i < NF; i = i 2) a[$i] = $(i 1);
for(i = 1; i <= NV; i ) s = s "\t" a[V[i]];
print s
}
'
with ruby:
printf '\t\t\ta\tx\ty\tc\n'
echo -n "$input" |
vars='a x y c' ruby -ane '
BEGIN{v = ENV["vars"].split};
h = Hash[*$F[3..-1]];
puts $F[0..2].concat(v.map{|v| h[v]}).join("\t")
'
outputs:
a x y c
13435 830169 830264 95 16
09433 835620 835672 46
30945 838405 838620 21 19
94853 850475 850660 15
04958 865700 865978 98 16
CodePudding user response:
In perl
:
$ perl -lane '
BEGIN { print join("\t", "", "", "", "a", "x", "y", "c"); }
my %vars = @F[3..$#F];
print join("\t", @F[0..2], @vars{qw/a x y c/});
' input.tsv
a x y c
13435 830169 830264 95 16
09433 835620 835672 46
30945 838405 838620 21 19
94853 850475 850660 15
04958 865700 865978 98 16
The fourth column and all following are taken as key/value pairs for a hash table, and then the variable values that are present in it are extracted in the right order, along with the first three columns. Makes heavy use of slices.