How to extract the file name of a pdf link including numbers using regular expressions-CodePudding

I have the following url string:

"https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"

How can I use regular expression to get the filename of an url? I have tried:

text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"

text = re.sub("/[^/]*$", '', text)
text

but I am receiving:

'https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022'

The desired output is:

"amtsblatt_05_20220209.pdf"

I am thankful for any advice.

CodePudding user response：

You can go with:

import re

text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"

pdf_name = re.findall("/[^/]*$", text)[0]
print(pdf_name)

or simple with:

pdf_name = text.split('/')[-1]
print(pdf_name)

CodePudding user response：

If you exactly want to use regex then try

re.findall(r"/(\w \.pdf)", text)[-1]

CodePudding user response：

Alternative to regular expressions, which may make it more clear what the intend is

import urllib
import pathlib

text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
filename = pathlib.Path(urllib.parse.urlparse(text).path).name

Or with additional package urplath

import urlpath

filename = urlpath.URL(text).name

As an answer why your approach did not work

re.sub("/[^/]*$", '', text)

This does find your desired string, but it then substitutes it with nothing, so it removes what you have found. You'd probably wanted to either find the string

>>> re.search("/[^/]*$", text).group()
'/amtsblatt_05_20220209.pdf'
# Without the leading /
>>> re.search("/([^/]*)$", text).group(1)
'amtsblatt_05_20220209.pdf'

Or you wanted to discard everything that is not the filename

>>> re.sub("^.*/(?!=[^\/] $)", "", text)
'amtsblatt_05_20220209.pdf'