Extract link from string-CodePudding

I am currently trying to find out the way how to efficiently extrant substrings from my file in Python. I have a file with extracted html code

<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>

but mostly I am failing with extraction. My goal is to have another txt file with extraxted clear link "/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" without HTML syntaxes. Means strast with /archiv and ends with .pdf

I tried to explore for each method and regex, but not so lucky since I am begginer. I would be happy for any advice.

CodePudding user response：

Use the urllib.parse.urlparse function to parse the URL. Here's an example:

from urllib.parse import urlparse

url_str = 'https://example.com'
url_obj = urlparse(url_str)

if not (url_obj.scheme and url_obj.path): # validity check
  print(f'The URL {url_str} is invalid!')
else:
  print(f'The URL {url_str} is valid!')

CodePudding user response：

Using regular python we can do this easily without any libraries:

text = """
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
<td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>
"""

links = [line.split('<a href="')[1].split('"')[0] for line in text.split('\n') if '<a href="' in line]

print(links)

The output:

['/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf', '/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf', '/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf']

What this does is split the file by '\n' and then returns the text between the quotes in the href= section for each line. It creates an array called 'links'

To write the array to a file, with each link on one line:

f = open('test.txt', 'w')
for link in links:
    f.write(link   '\n')
f.close()

CodePudding user response：

re.findall can help....

t = '''

<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>

'''

print(re.findall(r'\/archiv.*?pdf', t))

['/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf', '/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf', '/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf']