grep filename with extension from a list of URLS using [regex]-CodePudding

Hello everyone I'm working on a list of urls where I only need to grep all file names that end with .asp or .aspx and there should not be any duplicates as well so I came across this solution to remove everything before the last / and after .asp

I tried this regex which removes everything before the last /

([^\/] $)

e.g.

abc/abc/abc/xyz.asp >> xyz.asp

But if there is a / after .asp it starts selecting after that /

abc/abc/abc/xyz.asp?ijk=lmn/opq >> opq which i do not want

I want to grep only strings that has .asp and .aspx and remove every single character before the last / and after it.

I simple words I want to grep the filename.asp or filename.aspx only

Sample Input https://www.redacted.com/abc/xyz.aspx?something=something

Sample output:

xyz.aspx

Sample input: https://www.redacted.com/abc/xyz/file.aspx?z=x&LOC=http://www.redacted.com/asp/anotherfile-asp/_/CRID--7/thirdfile.asp?ui=hash

sample output:

file.aspx, anotherfile-asp, thirdfile.asp

CodePudding user response：

With your shown samples, in GNU awk you could try following regex, along with its match and RS function used with regex.

awk -v RS='[^.]*[-\\.]aspx?' '
RT{
  num=split(RT,arr,"[//]")
  for(i=1;i<=num;i  ){
    if(arr[i]~/[-.]asp/){
      print arr[i]
    }
  }
}
' Input_file

If your file contains both the lines(shown in your question) then samples output will be as follows:

xyz.aspx
file.aspx
anotherfile-asp
thirdfile.asp

Explanation: Simple explanation would be, setting RS(record separator) as [^.]*[-\\.]asp for whole Input_file. Then in main program spitting records with // and checking if any parts contains -asp OR .asp then print that matched part, like shown in sample output in above.

CodePudding user response：

This is Python, but the regex should work elsewhere.

import re

s1 = "https://www.redacted.com/abc/xyz.aspx?something=something"
s2 = "https://www.redacted.com/abc/xyz/file.aspx?z=x&LOC=http://www.redacted.com/asp/anotherfile-asp/_/CRID--7/thirdfile.asp?ui=hash"

# We want the set of things that is not a slash, until we get to .asp or
# .aspx, followed either by ? or end of string.

name = r"[^/]*\.aspx?((?=\?)|$)"

for s in s1,s2:
    print( re.search( name, s ).group() )

Output:

xyz.aspx
file.aspx

CodePudding user response：

Another option could be using awk and first split on the URL encoded parts that should not be part of the result.

Then from all the parts, match only the strings that do not contain / and end on asp with an optional x, and preceded by either - or .

awk '
{
  n = split($0 ,a, /(%[A-F0-9] ) /)
  for (i=1; i <= n; i  ) {
    if (match(a[i], /[^/] [.-]aspx?/)){
      print substr(a[i], RSTART, RLENGTH)
    }
  }
}
' file

Output

file.aspx
anotherfile-asp
thirdfile.asp
xyz.aspx

If grep -P is supported, you might also use a Perl-compatible regular expression that skips the URL encoded parts:

grep -oP "(?:%[A-F0-9] ) (*SKIP)(*F)|(?:(?!%[A-F0-9])[^/])*[-.]aspx?" file

See a regex demo.