I'm trying to match a pattern in lines of an HTML file.
This is a snippet of the file
<tr>
<td>CBB <span> • </span> CB <span> • </span> LTSB</td>
<td>2022-09-13</td>
<td>14393.5356</td>
<td><a href="https://support.microsoft.com/help/5017305" target="_blank" data-linktype="external">KB5017305</a></td>
</tr>
<tr>
<td>CBB <span> • </span> CB <span> • </span> LTSB</td>
<td>2022-08-09</td>
<td>14393.5291</td>
<td><a href="https://support.microsoft.com/help/5016622" target="_blank" data-linktype="external">KB5016622</a></td>
</tr>
<tr>
<td>CBB <span> • </span> CB <span> • </span> LTSB</td>
<td>2022-07-12</td>
<td>14393.5246</td>
<td><a href="https://support.microsoft.com/help/5015808" target="_blank" data-linktype="external">KB5015808</a></td>
</tr>
<tr>
<td>CBB <span> • </span> CB <span> • </span> LTSB</td>
<td>2022-06-14</td>
<td>14393.5192</td>
<td><a href="https://support.microsoft.com/help/5014702" target="_blank" data-linktype="external">KB5014702</a></td>
</tr>
<tr>
And this is the code that i'm running.
with open('file.html') as htmltext:
htmldata = htmltext.readlines()
pattern = "([\r\n].*?)(?:=?\r|\n)(.*?(?:14393).*)"
for data in htmldata:
matchedx = re.search(pattern, data)
if matchedx:
print(matchedx)
The regex pattern is to match a string and also return the previous line.
Checking the regex here https://regex101.com/r/7vI31a/1 returns matches, however running in python no matches are found.
Using this as a pattern returns matches when running in python.
pattern = "(14393.*)"
CodePudding user response:
As jasonharper comments, you need to apply your regular expression to all the data.
This works for me:
import re
# data = open('file.html').read()
data = """<tr>
<td>CBB <span> • </span> CB <span> • </span> LTSB</td>
<td>2022-09-13</td>
<td>14393.5356</td>
<td><a href="https://support.microsoft.com/help/5017305" target="_blank" data-linktype="external">KB5017305</a></td>
</tr>
<tr>
<td>CBB <span> • </span> CB <span> • </span> LTSB</td>
<td>2022-08-09</td>
<td>14393.5291</td>
<td><a href="https://support.microsoft.com/help/5016622" target="_blank" data-linktype="external">KB5016622</a></td>
</tr>
<tr>
<td>CBB <span> • </span> CB <span> • </span> LTSB</td>
<td>2022-07-12</td>
<td>14393.5246</td>
<td><a href="https://support.microsoft.com/help/5015808" target="_blank" data-linktype="external">KB5015808</a></td>
</tr>
<tr>
<td>CBB <span> • </span> CB <span> • </span> LTSB</td>
<td>2022-06-14</td>
<td>14393.5192</td>
<td><a href="https://support.microsoft.com/help/5014702" target="_blank" data-linktype="external">KB5014702</a></td>
</tr>
<tr>"""
pattern = re.compile("([\r\n].*?)(?:=?\r|\n)(.*?(?:14393).*)")
matches = re.findall(pattern, data)
for match in matches:
print(match)
Which prints:
('\n<td>2022-09-13</td>', '<td>14393.5356</td>')
('\n<td>2022-08-09</td>', '<td>14393.5291</td>')
('\n<td>2022-07-12</td>', '<td>14393.5246</td>')
('\n<td>2022-06-14</td>', '<td>14393.5192</td>')