Home > OS >  I need to keep anything between a certain pattern
I need to keep anything between a certain pattern

Time:11-09

I'm working with ~1800 whole genome sequences of SARS-CoV-2 and I want to keep only the "EPI_ISL_NC045512" pattern, which is between two "|". This would be my string:

>New|hCoV-19/Belize/BZ-CML-TCMC-BZ002-0820/2020|EPI_ISL_NC045512|2020-08-12NC045512
actcacgcagtataattaataactaattactgtcgttgacaggacacgagtaactcgtctatcttctgcaggctgcttacggtttcgtccgtg

I would need to also keep the ">" I tried (>)(. )([EPI. ])(. ) but it didn't work

CodePudding user response:

A simple one could be this one: |(EPI([A-Z0-9_] ))|

Assuming only A-Z 0-9 and _ on your pattern, the result is in group 1 (surrounded by parenthesis).

CodePudding user response:

You could use 2 capture groups if you want to keep > in a group and EPI_ISL_NC045512 in a group

(>)[^>]*\|(EPI[^|]*)\|
  • (>) Capture > in group 1
  • [^>]*\| Optionally match any char except > and then match |
  • (EPI[^|]*) Capture EPI followed by any char except | in group 2
  • \| Match |

Regex demo

  • Related