label the lines that have and do not have result in the next line-CodePudding

I have a list like this:

#chrom  start   end seq
#chrom  start   end seq
#chrom  start   end seq
chr1    214435102   214435132   AAACCGGTCAGCTT...
chr1    214435135   214435165   TCAATGGACTGTTC...
#chrom  start   end seq 
chr1    214873901   214873931   CCAAATCCCTCAGG...

As it is seen some of them have results (3rd and 4th) and some of them do not (1st and 2nd)

What I am trying to do is first read the line that starts with '#chrom' and read the line after that line. If the next line also starts with '#chrom' print 0, if it starts with something else print 1. And do it for every line that starts with '#chrom' without passing any. I am kind of trying to label the ones that have sequences. I am guessing that there would be an easier way of doing it but what I could create up until now is two lines of code;

awk '/#chrom/{getline; print}' raw.txt > nextLine.txt
awk '$1 == "#chrom" { print "0" } $1 != "#chrom" { print "1" }' nextLine.txt > labeled.txt

Expected output in the labeled.txt;

I guess the second line of the code works well. However, the line counts that include '#chrom' in the raw.txt and nextLine.txt are not matching. If you could help me with that I would appreciate it.

Thank you

CodePudding user response：

This should do it:

awk 'BEGIN { chrom=0 } { 
   if ($1=="#chrom") { 
      if (chrom==1) print 0; else chrom=1; } 
   else { 
      if (chrom==1) print 1; chrom=0 } 
   }'

CodePudding user response：

One awk idea:

awk '
{ if (prev=="#chrom")                 # for 1st line of input prev==""
     print ($1 == "#chrom" ? 0 : 1)   # use ternary operator to determine output
  prev=$1
}
' raw.txt

or as a one-liner:

awk '{if (prev=="#chrom") print ($1 == "#chrom" ? 0 : 1); prev=$1}' raw.txt

This generates:

CodePudding user response：

As in life, in software its much easier to do things based on what HAS happened than on what WILL happen. So don't write requirements based on what the NEXT line of input will be, write them based on what the PREVIOUS line of input was and you'll find it much easier to figure out the matching code and that code will be simpler than trying to determine the next line of input.

$ cat tst.awk
($1 == "#chrom") && (NR > 1) {
    print ( prev == "#chrom" ? 0 : 1 )
}
{ prev = $1 }
END {
    print ( prev == "#chrom" ? 0 : 1 )
}

$ awk -f tst.awk file
0
0
1
1