I've a file:
# Sequence Data: seqnum=1;seqlen=3142;seqhdr="GUT_GENOME148255_1"
>1_59_1276_-
>2_1339_1842_-
>3_1873_2436_-
>4_2470_2928_-
# Sequence Data: seqnum=2;seqlen=3085;seqhdr="GUT_GENOME148255_2"
>1_3_266_-
>2_256_1038_-
# Sequence Data: seqnum=3;seqlen=3050;seqhdr="GUT_GENOME148255_3"
>1_22_1062_-
>2_1072_1746_-
>3_1767_3017_-
# Sequence Data: seqnum=4;seqlen=2934;seqhdr="GUT_GENOME148255_4"
>1_83_436_-
>2_438_1430_-
>3_1432_1872_-
>4_1986_2933_
I'd like to add the number corresponding to seqnum
to every row starting with >
, so that it'd look like this:
# Sequence Data: seqnum=1;seqlen=3142;seqhdr="GUT_GENOME148255_1"
>1-1_59_1276_-
>1-2_1339_1842_-
>1-3_1873_2436_-
>1-4_2470_2928_-
# Sequence Data: seqnum=2;seqlen=3085;seqhdr="GUT_GENOME148255_2"
>2-1_3_266_-
>2-2_256_1038_-
# Sequence Data: seqnum=3;seqlen=3050;seqhdr="GUT_GENOME148255_3"
>3-1_22_1062_-
>3-2_1072_1746_-
>3-3_1767_3017_-
# Sequence Data: seqnum=4;seqlen=2934;seqhdr="GUT_GENOME148255_4"
>4-1_83_436_-
>4-2_438_1430_-
>4-3_1432_1872_-
>4-4_1986_2933_
The procedure should go over the entire file. I suspect this could be done with awk
but my efforts to apply it are useless.
CodePudding user response:
This might work for you (GNU sed):
sed -E '/seqnum=/h;/^>/G;s/^>(.*)\n[^0-9] ([0-9] ).*/>\2-\1/' file
Make a copy of the line containing seqnum
.
For every line beginning >
, append the copy and using pattern matching and back references, format as required.
CodePudding user response:
awk '
/^# Sequence Data/ {n }
/^>/ {sub(/>/, ">" n "-")}
1
' file
Self-explanatory I think.
CodePudding user response:
If you are trying to extract the seqnum=
and can't guarantee that the numbers increase monotonically, try
awk '/^# Sequence Data:/ {
s=$0; sub(/.*seqnum=/, "", s); s = 0 }
/^>/ { sub(/^>/, ">" s "-" } 1' file
The addition of 0 to s
forces the value to be a number, which also trims off the non-numeric tail from the value we originally capture ($0
is the entire current input line).