I have the following url string:
"https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
How can I use regular expression to get the filename of an url? I have tried:
text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
text = re.sub("/[^/]*$", '', text)
text
but I am receiving:
'https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022'
The desired output is:
"amtsblatt_05_20220209.pdf"
I am thankful for any advice.
CodePudding user response:
You can go with:
import re
text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
pdf_name = re.findall("/[^/]*$", text)[0]
print(pdf_name)
or simple with:
pdf_name = text.split('/')[-1]
print(pdf_name)
CodePudding user response:
If you exactly want to use regex then try
re.findall(r"/(\w \.pdf)", text)[-1]
CodePudding user response:
Alternative to regular expressions, which may make it more clear what the intend is
import urllib
import pathlib
text = "https://www.stadt-koeln.de/mediaasset/content/pdf13/amtsblatt/amtsblaetter-2022/amtsblatt_05_20220209.pdf"
filename = pathlib.Path(urllib.parse.urlparse(text).path).name
Or with additional package urplath
import urlpath
filename = urlpath.URL(text).name
As an answer why your approach did not work
re.sub("/[^/]*$", '', text)
This does find your desired string, but it then substitutes it with nothing, so it removes what you have found. You'd probably wanted to either find the string
>>> re.search("/[^/]*$", text).group()
'/amtsblatt_05_20220209.pdf'
# Without the leading /
>>> re.search("/([^/]*)$", text).group(1)
'amtsblatt_05_20220209.pdf'
Or you wanted to discard everything that is not the filename
>>> re.sub("^.*/(?!=[^\/] $)", "", text)
'amtsblatt_05_20220209.pdf'