Home > Blockchain >  Extracting two strings from between two characters. Why doesn't my regex match and how can I im
Extracting two strings from between two characters. Why doesn't my regex match and how can I im

Time:05-28

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:

  • It always begins with the letter C, in either lowercase or uppercase, which is then followed by a number of hexadecimal characters (meaning it can contain the letters A to F and numbers from 1 to 9, with no zeros included).
  • After those hexadecimal characters comes a letter P, also either in lowercase or uppercase
  • And then some more hexadecimal characters (again, excluding 0).

Examples that would match would be:

c45AFP2
CAPF
c56Bp26
CA6C22pAAA

Examples that wouldn't match would be:

BCA6C22pAAA  # It doesn't begin with C
c56Bp  # There aren't any characters after P
c45AF0P2  # Contains a zero

I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P

So far I've come up with this:

(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*

A breakdown would be:

(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string

[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times

(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]

[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times

But with the above regex I can't match any of the strings!

When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.

Meaning the below regex works:

(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*

I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?

But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.

That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.

Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?

I still need it to:

  • Not be a match if the string contains the number 0
  • Only be a match if ALL conditions are met

Thank you

CodePudding user response:

To match both groups before and after P or p

(?<=^[Cc])[1-9a-fA-F] (?=[Pp]([1-9a-fA-F] $))
  • (?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
  • [1-9a-fA-F] - Matches hexadecimal characters one or more times
  • (?=[Pp] - Positive Lookahead for case insensitive p or P
  • ([1-9a-fA-F] $) - Cature group for one or more hexadecimal characters following the pP View Demo

CodePudding user response:

Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).

Also, the final quantifier should be not * because you require at least one trailing character after the p.

The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.

If you use the case insensitive flag, it makes the regex much smaller and easier to read.

A working regex that captures the 2 hex parts in groups 1 and 2 is:

(?i)^c([a-f1-9]*)p([a-f1-9] )

See live demo.

Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

  • Related