Extract URLs from HTML and JS source, with multiple on a line-CodePudding

I'd like to list all the domains our source code refers to, allowing the limitation of finding only those that are statically referenced and begin with https?://. For example, I've tried the following:

find -s [^.]* -print0 | xargs -0 sed -En 's/.*https?:\/\/([a-z0-9\-\.\_] ).*/\1/p' | sort | uniq

The bug is, when there's more than one domain on a single line, only one will be returned. Can this be remedied with simple shell tools, ie without fully parsing the HTML?

CodePudding user response：

The regular expression .* is greedy, so putting it at both the beginning and end of your regex will discard any other URLs that are on the same line.

Standard grep is unable to print capture groups like ([a-z0-9-._] ), but if you have perl, replace this:

sed -En 's/.*https?:\/\/([a-z0-9\-\.\_] ).*/\1/p'

with this:

perl -nle 'print $1 while m{https?://([a-z0-9-._] )}g'

https?://([a-z0-9-._] ) is our new regular expression that matches only what we're looking for, keeping each line intact.
while m{...}g iterates through each match of the regex.
print $1 displays the first capture of our regex ([a-z0-9-._] )

Your final command will be:

find -s [^.]* -print0 | xargs -0 perl -nle 'print $1 while m{https?://([a-z0-9-._] )}g' | sort | uniq