Python regex not returning any matches-CodePudding

I'm trying to match a pattern in lines of an HTML file.

This is a snippet of the file

<tr>
<td>CBB <span> &bull; </span> CB <span> &bull; </span> LTSB</td>
<td>2022-09-13</td>
<td>14393.5356</td>
<td><a href="https://support.microsoft.com/help/5017305" target="_blank" data-linktype="external">KB5017305</a></td>
</tr>
<tr>
<td>CBB <span> &bull; </span> CB <span> &bull; </span> LTSB</td>
<td>2022-08-09</td>
<td>14393.5291</td>
<td><a href="https://support.microsoft.com/help/5016622" target="_blank" data-linktype="external">KB5016622</a></td>
</tr>
<tr>
<td>CBB <span> &bull; </span> CB <span> &bull; </span> LTSB</td>
<td>2022-07-12</td>
<td>14393.5246</td>
<td><a href="https://support.microsoft.com/help/5015808" target="_blank" data-linktype="external">KB5015808</a></td>
</tr>
<tr>
<td>CBB <span> &bull; </span> CB <span> &bull; </span> LTSB</td>
<td>2022-06-14</td>
<td>14393.5192</td>
<td><a href="https://support.microsoft.com/help/5014702" target="_blank" data-linktype="external">KB5014702</a></td>
</tr>
<tr>

And this is the code that i'm running.

with open('file.html') as htmltext:
htmldata = htmltext.readlines()

pattern = "([\r\n].*?)(?:=?\r|\n)(.*?(?:14393).*)"

for data in htmldata:
    matchedx = re.search(pattern, data)
    if matchedx:
      print(matchedx)

The regex pattern is to match a string and also return the previous line.
Checking the regex here https://regex101.com/r/7vI31a/1 returns matches, however running in python no matches are found.

Using this as a pattern returns matches when running in python.

pattern = "(14393.*)"

CodePudding user response：

As jasonharper comments, you need to apply your regular expression to all the data.

This works for me:

import re
# data = open('file.html').read()
data = """<tr>
<td>CBB <span> &bull; </span> CB <span> &bull; </span> LTSB</td>
<td>2022-09-13</td>
<td>14393.5356</td>
<td><a href="https://support.microsoft.com/help/5017305" target="_blank" data-linktype="external">KB5017305</a></td>
</tr>
<tr>
<td>CBB <span> &bull; </span> CB <span> &bull; </span> LTSB</td>
<td>2022-08-09</td>
<td>14393.5291</td>
<td><a href="https://support.microsoft.com/help/5016622" target="_blank" data-linktype="external">KB5016622</a></td>
</tr>
<tr>
<td>CBB <span> &bull; </span> CB <span> &bull; </span> LTSB</td>
<td>2022-07-12</td>
<td>14393.5246</td>
<td><a href="https://support.microsoft.com/help/5015808" target="_blank" data-linktype="external">KB5015808</a></td>
</tr>
<tr>
<td>CBB <span> &bull; </span> CB <span> &bull; </span> LTSB</td>
<td>2022-06-14</td>
<td>14393.5192</td>
<td><a href="https://support.microsoft.com/help/5014702" target="_blank" data-linktype="external">KB5014702</a></td>
</tr>
<tr>"""

pattern = re.compile("([\r\n].*?)(?:=?\r|\n)(.*?(?:14393).*)")
matches = re.findall(pattern, data)
for match in matches:
    print(match)

Which prints:

('\n<td>2022-09-13</td>', '<td>14393.5356</td>')
('\n<td>2022-08-09</td>', '<td>14393.5291</td>')
('\n<td>2022-07-12</td>', '<td>14393.5246</td>')
('\n<td>2022-06-14</td>', '<td>14393.5192</td>')