Hello everyone I'm working on a list of urls where I only need to grep all file names that end with .asp or .aspx and there should not be any duplicates as well so I came across this solution to remove everything before the last /
and after .asp
I tried this regex which removes everything before the last /
([^\/] $)
e.g.
abc/abc/abc/xyz.asp
>> xyz.asp
But if there is a /
after .asp
it starts selecting after that /
abc/abc/abc/xyz.asp?ijk=lmn/opq
>> opq
which i do not want
I want to grep only strings that has .asp
and .aspx
and remove every single character before the last /
and after it.
I simple words I want to grep the filename.asp
or filename.aspx
only
Sample Input
https://www.redacted.com/abc/xyz.aspx?something=something
Sample output:
xyz.aspx
Sample input:
https://www.redacted.com/abc/xyz/file.aspx?z=x&LOC=http://www.redacted.com/asp/anotherfile-asp/_/CRID--7/thirdfile.asp?ui=hash
sample output:
file.aspx, anotherfile-asp, thirdfile.asp
CodePudding user response:
With your shown samples, in GNU awk
you could try following regex, along with its match
and RS
function used with regex.
awk -v RS='[^.]*[-\\.]aspx?' '
RT{
num=split(RT,arr,"[//]")
for(i=1;i<=num;i ){
if(arr[i]~/[-.]asp/){
print arr[i]
}
}
}
' Input_file
If your file contains both the lines(shown in your question) then samples output will be as follows:
xyz.aspx
file.aspx
anotherfile-asp
thirdfile.asp
Explanation: Simple explanation would be, setting RS
(record separator) as [^.]*[-\\.]asp
for whole Input_file. Then in main program spitting records with //
and checking if any parts contains -asp OR .asp then print that matched part, like shown in sample output in above.
CodePudding user response:
This is Python, but the regex should work elsewhere.
import re
s1 = "https://www.redacted.com/abc/xyz.aspx?something=something"
s2 = "https://www.redacted.com/abc/xyz/file.aspx?z=x&LOC=http://www.redacted.com/asp/anotherfile-asp/_/CRID--7/thirdfile.asp?ui=hash"
# We want the set of things that is not a slash, until we get to .asp or
# .aspx, followed either by ? or end of string.
name = r"[^/]*\.aspx?((?=\?)|$)"
for s in s1,s2:
print( re.search( name, s ).group() )
Output:
xyz.aspx
file.aspx
CodePudding user response:
Another option could be using awk
and first split on the URL encoded parts that should not be part of the result.
Then from all the parts, match only the strings that do not contain /
and end on asp with an optional x, and preceded by either -
or .
awk '
{
n = split($0 ,a, /(%[A-F0-9] ) /)
for (i=1; i <= n; i ) {
if (match(a[i], /[^/] [.-]aspx?/)){
print substr(a[i], RSTART, RLENGTH)
}
}
}
' file
Output
file.aspx
anotherfile-asp
thirdfile.asp
xyz.aspx
If grep -P is supported, you might also use a Perl-compatible regular expression that skips the URL encoded parts:
grep -oP "(?:%[A-F0-9] ) (*SKIP)(*F)|(?:(?!%[A-F0-9])[^/])*[-.]aspx?" file
See a regex demo.