I am currently trying to find out the way how to efficiently extrant substrings from my file in Python. I have a file with extracted html code
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
<td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>
but mostly I am failing with extraction. My goal is to have another txt file with extraxted clear link "/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" without HTML syntaxes. Means strast with /archiv and ends with .pdf
I tried to explore for each method and regex, but not so lucky since I am begginer. I would be happy for any advice.
CodePudding user response:
Use the urllib.parse.urlparse
function to parse the URL. Here's an example:
from urllib.parse import urlparse
url_str = 'https://example.com'
url_obj = urlparse(url_str)
if not (url_obj.scheme and url_obj.path): # validity check
print(f'The URL {url_str} is invalid!')
else:
print(f'The URL {url_str} is valid!')
CodePudding user response:
Using regular python we can do this easily without any libraries:
text = """
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
<td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>
"""
links = [line.split('<a href="')[1].split('"')[0] for line in text.split('\n') if '<a href="' in line]
print(links)
The output:
['/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf', '/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf', '/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf']
What this does is split the file by '\n' and then returns the text between the quotes in the href= section for each line. It creates an array called 'links'
To write the array to a file, with each link on one line:
f = open('test.txt', 'w')
for link in links:
f.write(link '\n')
f.close()
CodePudding user response:
re.findall can help....
t = '''
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
<td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>
'''
print(re.findall(r'\/archiv.*?pdf', t))
['/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf', '/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf', '/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf']