Home > Blockchain >  Why is the conditional statement in my regex pattern getting rid of other matches when it shouldn�
Why is the conditional statement in my regex pattern getting rid of other matches when it shouldn�

Time:07-11

Here is my text (it will be looking through other text as well, but this is what I am having trouble with):

<a href="/wiki/Basketball" title="Basketball">basketball</a>, the 

<li ><a href="//cs.wikipedia.org/wiki/" title="" lang="cs" hreflang="cs">esky</a><

<li ><a href="//da.wikipedia.org/wiki/" title="" lang="da" hreflang="da"><b>Dansk</b></a></li>

I'm trying to get 3 matches where 2 groups (separated by a semicolon) are:

/wiki/Basketball;basketball
//cs.wikipedia.org/wiki/;esky
//da.wikipedia.org/wiki/;Dansk

With this pattern: (?<=<a href=")(.*?)".*?>([\w\s\./,0-9]*?)<, I can match the first two correctly. To try to also get the last match, I added in a conditional to check for the <b>: (?<=<a href=")(.*?)".*?>(<?)(?(2)b>)([\w\s\./,0-9]*?)<. This gets the last match correctly, but now the first two don't match.

Can you please explain why this happens and what the correct way to do this is?

CodePudding user response:

To be honest i have trouble understanding 'conditional' myself. I asked question about it, but didn't get an answer.

I took advantage of [^] and did this:

re.findall('(?<=<a href=")(.*?)".*>([^>] )<',string)

or

re.findall('(?<=<a href=")(.*?)".*(?<=>)([^>] )(?=<)',string)

In both cases the second group matches a non-empty string that follows '>', do not contain '>' and preceds '<'. It should match the last non_empty string between tags.

By addind '?' to the '.' after the first group, the second group should match the first non_empty string between tags:

re.findall('(?<=<a href=")(.*?)".*?(?<=>)([^>] )(?=<)',string)

Separately, the following snippet should catch all non-empty strings between tags:

re.findall('(?<=>)([^>] )(?=<)',string)

I hope I'm not wrong, but if that's the case, please tell me.

  • Related