Get a list of URLs from large document-CodePudding

I am trying to get a list of URLs "https://www.crocodiletrading.co.uk/" from an HTM file, I also need to get anything that comes after the main URL for example /blog/name-of-blog etc.

I am using Notepad and Regex to try and accomplish this but I am struggling. I don't really understand Regex.

I've tried something like this: .*?(https\:\/\/www\.[a-zA-Z0-9\.\/\-] )

Can anyone let me know how I can accomplish this?

I'm getting a list of the URLs that have been flagged as broken so I can then use this to set up 301 redirects.

Here is the HTML FILE if anyone wants to take a look.

Thanks in advance.

CodePudding user response：

This function prints all the links that are inside all anchor tags (<a href="link to some page"></a>)

const getAllLinks = () => {
    const links = document.querySelectorAll("a");
    links.forEach(link => {
        console.log(link.href);
    })
}

CodePudding user response：

Here is what I ended up doing instead, using good old jQuery to grab the URLs that contained crocodiletrading.co.uk

jQuery( document ).ready( function() { 
var arr = [];
i = 0;

jQuery('a[href*="crocodiletrading.co.uk"]').each(function() {
    arr[i  ] = jQuery(this).attr('href');
});

var list = '<ul class="myList"><li class="ui-menu-item" role="menuitem"><a class="ui-all" tabindex="-1">'   arr.join('</a></li><li>')   '</li></ul>';
console.log(list);
  });