Home > Blockchain >  Awk output as input parameter in pipe
Awk output as input parameter in pipe

Time:12-08

I'm new to shell scripting and trying to figure out why my code won't work. I am trying to use the output from an awk statement as an input parameter in a pipe. I think this is pretty simple so hopefully someone can help me out!

I am starting with a very simple file called sample.bed that looks like this:

chr1 35031657 35037706
chr1 67544575 67550598
chr1 71979382 71985425
chr1 81404889 81410942
chr1 84518073 84524089
chr1 86214203 86220231

I have two variables, $P and $G, where $P is the path to the sample.bed file and $G is just set to "hg38".

The first thing I do is check the number of columns in the file. Since it is 3, and since $G is set to "hg38", the variable named $twobit is set.

num_col=$(cat $P | awk '{print NF; exit}')

if [ $num_col==3 ]
then
echo "3 column bed file, no directional considerations for sequences"
    if (( $G=="hg38" ))
    then
        twobit="https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit"
    fi
fi

So far so good.

twoBitToFa is a tool that accepts 4-column files, the 4th column being a unique identifier than is a combination of the first 3 columns. I attempt to create that necessary 4th column with awk '{print $0" "$1":"$2"-"$3}' $P, which shows up in my terminal correctly:

chr1 35031657 35037706 chr1:35031657-35037706
chr1 67544575 67550598 chr1:67544575-67550598
chr1 71979382 71985425 chr1:71979382-71985425
chr1 81404889 81410942 chr1:81404889-81410942
chr1 84518073 84524089 chr1:84518073-84524089
chr1 86214203 86220231 chr1:86214203-86220231

And here's where I am stuck. I want to use this printed output as input for the next step that uses the twoBitToFa tool. This tool needs the 4-column file as input and the $twobit URL to output a .fa file.

I have tried to save the awk output to a variable, but when I do that I lose the data table structure and everything prints on one line, like so:

new_dat=$(awk '{print $0" "$1":"$2"-"$3}' $P)
echo $new_dat

chr1 35031657 35037706 chr1:35031657-35037706 chr1 67544575 67550598 chr1:67544575-67550598 chr1 71979382 71985425 chr1:71979382-71985425 chr1 81404889 81410942 chr1:81404889-81410942 chr1 84518073 84524089 chr1:84518073-84524089 chr1 86214203 86220231 chr1:86214203-86220231

I tried using passing this new_dat variable on anyway, like so, but get an error that says "string option -bed must have a value"

twoBitToFa -bed $new_dat -udcDir=. $twobit stdout > $P.fa

I tried skipping saving the new variable and using the pipe operator to pass the data to twoBitToFa, which does sort of work, because it does create the .fa file, but it's trying to extract all of chr1, I guess because the sequence ranges aren't being passed properly.

awk '{print $0" "$1":"$2"-"$3}' $P | twoBitToFa -udcDir=. $twobit stdout > $P.fa

So, I think I have issues with printing and/or with how I am attempting to save my new_dat variable. Is "print" the wrong thing to be using with awk?

I don't want to write the 4-column file on my machine using > , since this would create many unnecessary intermediate files down the line. Ideally, my whole script would look something like this:

num_col=$(cat $P | awk '{print NF; exit}')

if [ $num_col==3 ]
then
echo "3 column bed file, no directional considerations for sequences"
    if (( $G=="hg38" ))
    then
        twobit="https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit"
    fi
awk '{print $0" "$1":"$2"-"$3}' $P | twoBitToFa -??? -udcDir=. $twobit stdout > $P.fa
fi

Where the ??? is somehow the output from the awk statement. Thanks in advance for any help and suggestions!

CodePudding user response:

Using -bed=stdin seems to work:

p=./sample.bed

awk '{print $0, $1":"$2"-"$3}' "$p" |
twoBitToFa -bed=stdin -udcDir=. "$twobit" stdout > "${p%.bed}".fa
  • Related