Home > Net >  Add prefix from another line to each string
Add prefix from another line to each string

Time:12-06

I've a file:

# Sequence Data: seqnum=1;seqlen=3142;seqhdr="GUT_GENOME148255_1"
>1_59_1276_-
>2_1339_1842_-
>3_1873_2436_-
>4_2470_2928_-
# Sequence Data: seqnum=2;seqlen=3085;seqhdr="GUT_GENOME148255_2"
>1_3_266_-
>2_256_1038_-
# Sequence Data: seqnum=3;seqlen=3050;seqhdr="GUT_GENOME148255_3"
>1_22_1062_-
>2_1072_1746_-
>3_1767_3017_-
# Sequence Data: seqnum=4;seqlen=2934;seqhdr="GUT_GENOME148255_4"
>1_83_436_-
>2_438_1430_-
>3_1432_1872_-
>4_1986_2933_ 

I'd like to add the number corresponding to seqnum to every row starting with >, so that it'd look like this:

# Sequence Data: seqnum=1;seqlen=3142;seqhdr="GUT_GENOME148255_1"
>1-1_59_1276_-
>1-2_1339_1842_-
>1-3_1873_2436_-
>1-4_2470_2928_-
# Sequence Data: seqnum=2;seqlen=3085;seqhdr="GUT_GENOME148255_2"
>2-1_3_266_-
>2-2_256_1038_-
# Sequence Data: seqnum=3;seqlen=3050;seqhdr="GUT_GENOME148255_3"
>3-1_22_1062_-
>3-2_1072_1746_-
>3-3_1767_3017_-
# Sequence Data: seqnum=4;seqlen=2934;seqhdr="GUT_GENOME148255_4"
>4-1_83_436_-
>4-2_438_1430_-
>4-3_1432_1872_-
>4-4_1986_2933_ 

The procedure should go over the entire file. I suspect this could be done with awk but my efforts to apply it are useless.

CodePudding user response:

This might work for you (GNU sed):

sed -E '/seqnum=/h;/^>/G;s/^>(.*)\n[^0-9] ([0-9] ).*/>\2-\1/' file

Make a copy of the line containing seqnum.

For every line beginning >, append the copy and using pattern matching and back references, format as required.

CodePudding user response:

awk '
  /^# Sequence Data/ {n  } 
  /^>/ {sub(/>/, ">" n "-")}
  1
' file

Self-explanatory I think.

CodePudding user response:

If you are trying to extract the seqnum= and can't guarantee that the numbers increase monotonically, try

awk '/^# Sequence Data:/ {
    s=$0; sub(/.*seqnum=/, "", s); s  = 0 }
  /^>/ { sub(/^>/, ">" s "-" } 1' file

The addition of 0 to s forces the value to be a number, which also trims off the non-numeric tail from the value we originally capture ($0 is the entire current input line).

  • Related