I'm working with ~1800 whole genome sequences of SARS-CoV-2 and I want to keep only the "EPI_ISL_NC045512" pattern, which is between two "|". This would be my string:
>New|hCoV-19/Belize/BZ-CML-TCMC-BZ002-0820/2020|EPI_ISL_NC045512|2020-08-12NC045512
actcacgcagtataattaataactaattactgtcgttgacaggacacgagtaactcgtctatcttctgcaggctgcttacggtttcgtccgtg
I would need to also keep the ">" I tried (>)(. )([EPI. ])(. ) but it didn't work
CodePudding user response:
A simple one could be this one: |(EPI([A-Z0-9_] ))|
Assuming only A-Z
0-9
and _
on your pattern, the result is in group 1 (surrounded by parenthesis).
CodePudding user response:
You could use 2 capture groups if you want to keep >
in a group and EPI_ISL_NC045512
in a group
(>)[^>]*\|(EPI[^|]*)\|
(>)
Capture>
in group 1[^>]*\|
Optionally match any char except>
and then match|
(EPI[^|]*)
Capture EPI followed by any char except|
in group 2\|
Match|