How to extract only filenames from an URL with RegEx even with ANSI escaped characters?-CodePudding

I need to extract ONLY the file names from any URL. I looked at all previous answers on stackoverflow regarding URLs and filenames, but no one considered the case of a file name with escaped characters.

I have for example an URL like this:

https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The Big Kahuna.webm.tar.gz&f=Crosstab Explained.ipynb&a=b&m=plok 2001.tar.gz

I tried many RegEx, and finally I found one that did not split the file names when it encounter the escaped character:

"(?:\w*:\/\/)?((?:[\w-_]*\.?) :?\d*(?:\/?[\w-_.] \/?)*)[\?]?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?[\&]?([\w-_]*)?=?([\/\w-_\?\)%\.\(\*\[\]\^<>\\]*)?"g

You can test it here: https://regex101.com/r/LRWlif/7

The results are a mess:

match,group,is_participating,start,end,content
1,0,yes,0,148,https://content.com/pbpython.py/notebooks/thirsty-allies.mov?file=The Big Kahuna.webm.tar.gz&f=Crosstab Explained.ipynb&a=b&m=plok 2001.tar.gz
1,1,yes,8,60,content.com/pbpython.py/notebooks/thirsty-allies.mov
1,2,yes,61,65,file
1,3,yes,66,94,The Big Kahuna.webm.tar.gz
1,4,yes,95,96,f
1,5,yes,97,123,Crosstab Explained.ipynb
1,6,yes,124,125,a
1,7,yes,126,127,b
1,8,yes,128,129,m
1,9,yes,130,148,plok 2001.tar.gz
2,0,yes,148,148,
2,1,yes,148,148,
2,2,yes,148,148,
2,3,yes,148,148,
2,4,yes,148,148,
2,5,yes,148,148,
2,6,yes,148,148,
2,7,yes,148,148,
2,8,yes,148,148,
2,9,yes,148,148,

The only good thing is that the filenames are all matched somehow, with no split parts, with the exception of "thirsty-allies.mov" that is matched along some url parts.

Also there is the issue that not all escape characters can be part of a filename. / for example is the "/" that separate folders in paths, and should not considered part of the match.

For example:

https://www.contoso.com/sites/marketing/documents/Shared Documents/Forms/AllItemA.aspx?RootFolder=/sites/marketing/documents/Shared Documents/PFProduct Promotion 2001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View={CD527605-9A7A-448D-9A35-67A33EF9F766}

With the same RegEx we get this result:

match,group,is_participating,start,end,content
1,0,yes,0,288,https://www.contoso.com/sites/marketing/documents/Shared Documents/Forms/AllItemA.aspx?RootFolder=/sites/marketing/documents/Shared Documents/PFProduct Promotion 2001.docx&FolderCTID=0x012000F2A09653197F4F4F919923797C42ADEC&View={CD527605-9A7A-448D-9A35-67A33EF9F766}
1,1,yes,8,56,www.contoso.com/sites/marketing/documents/Shared
1,2,yes,56,56,
1,3,yes,56,99, Documents/Forms/AllItemA.aspx?RootFolder
1,4,yes,99,99,
1,5,yes,100,188,/sites/marketing/documents/Shared Documents/PFProduct Promotion 2001.docx
1,6,yes,189,199,FolderCTID
1,7,yes,200,240,0x012000F2A09653197F4F4F919923797C42ADEC
1,8,yes,241,245,View
1,9,yes,246,288,{CD527605-9A7A-448D-9A35-67A33EF9F766}
2,0,yes,288,288,
2,1,yes,288,288,
2,2,yes,288,288,
2,3,yes,288,288,
2,4,yes,288,288,
2,5,yes,288,288,
2,6,yes,288,288,
2,7,yes,288,288,
2,8,yes,288,288,
2,9,yes,288,288,

As you can see, the filename to match is:

PFProduct Promotion 2001.docx

but the RegEx matched:

/sites/marketing/documents/Shared Documents/PFProduct Promotion 2001.docx

How can I get just the filenames and nothing else?

CodePudding user response：

There is no language tagged, but if you know that you always have urls you might use

(?<=[=\/]|/)(?:(?!/)[^?&\s\/]) \.\w (?=[?&]|$)

Explanation

(?<= Positive lookbehind, assert what is to the left of the current position is
- [=\/] Match either = or /
- | Or
- / Match literally
) Close the lookbehind
(?: Non capture group
- (?!/)[^?&\s\/] Match 1 char other than what is listed in the character class if / is not directly to the right of the current position
) Close the non capture group and repeat 1 times
\.\w Match a dot and 1 or more word characters
(?=[?&]|$) Positive lookahead, assert either ? or & or the end of the string directly to the right of the current position

Regex demo

Other variations

Or with a capture group if the lookbehind does not work with not fixed width:

(?:[=\/]|/)((?:(?!/)[^?&\s\/]) \.\w )(?=[?&]|$)

Regex demo

In languages where an infinite quantifier in the lookbehind is supported:

(?<=https?:\/\/\S*(?:[=\/]|/))(?:(?!/)[^?&\s\/]) \.\w (?=[?&]|$)

Regex demo