This is the input sample:
<!DOCTYPE html>
<meta property="og:image" content="http://www.mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="www.mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="https://mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:image" content="http://mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<link rel="preconnect" href="https://cdn.ncbi.nlm.nih.gov">
<link rel="preconnect" href="https://www.ncbi.nlm.nih.gov">
<link rel="preconnect" href="https://www.google-analytics.com">
<link rel="stylesheet" href="https://cdn.ncbi.nlm.nih.gov/pubmed/c6188713-b612-4503-95e5-39fd91b4d5be/CACHE/css/output.35e8b192ea09.css" type="text/css">
<link rel="stylesheet" href="https://cdn.ncbi.nlm.nih.gov/pubmed/c6188713-b612-4503-95e5-39fd91b4d5be/CACHE/css/output.452c70ce66f7.css" type="text/css">
<meta property="og:image" content="https://www.mydomainname.com/pathout_themes/pathology_outlines/images/logo_for_social.jpg" />
<meta property="og:title" content="MyDomainName.com"/>
<meta property="og:url" content="https://www.mydomainname.com/"/>
<meta property="og:description" content="MyDomainName.com, free, updated outline surgical domain clinical gjkjjhkl lkjkhkj jobs, conferences, fellowships, books" />
<link rel="apple-touch-icon" href="https://www.mydomainname.com/apple-touch-icon.png">
<link rel="icon" sizes="180x180" href="https://www.mydomainname.com/apple-touch-icon.png">
<link rel="shortcut icon" href="https://www.mydomainname.com/pathout_themes/pathology_outlines/favicon.ico" />
<!--<link href="https://www.mydomainname.com/pathout_themes/pathology_outlines/css/reset.css" rel="stylesheet" type="text/css" />-->
<link rel="stylesheet" href="https://www.mydomainname.com/pathout_core/plugins/font-awesome-4.6.3/css/font-awesome.min.css" type="text/css" /><!-- font awesome -->
<!--<link href='http://fonts.googleapis.com/css?family=Lato:400,700,700italic,400italic,900,900italic' rel='stylesheet' type='text/css'>-->
<link href="https://www.mydomainname.com/pathout_core/plugins/lightbox/css/lightbox.css" rel="stylesheet" />
<link href="https://www.mydomainname.com/pathout_themes/pathology_outlines/css/main.css?ver=1.3.4" rel="stylesheet" />
<!--<script id="Cookiebot" src="https://consent.cookiebot.com/uc.js" data-cbid="8539fe38-38ce-448a-beb8-edb691018d2d" data-blockingmode="auto" type="text/javascript"></script>-->
<img src="https://www.mydomainname.com/pathout_themes/pathology_outlines/images/ntbg3.png" width="1" height="1" alt="Image 01" />
<img src="https://www.mydomainname.com/pathout_themes/pathology_outlines/images/loader-sprite.png" width="1" height="1" alt="Image 02" />
<script async defer src="https://scripts.simpleanalyticscdn.com/latest.js"></script>
<noscript><img src="https://queue.simpleanalyticscdn.com/noscript.gif" alt="" referrerpolicy="no-referrer-when-downgrade" /></noscript>
<!--<script async defer src="https://scripts.simpleanalyticscdn.com/auto-events.js"></script>-->
<script async src="https://scripts.simpleanalyticscdn.com/auto-events.js" /></body>
</html>
It is a page html source code modified by myself in order to have more links.
I want to match in Python using regex module all the URLs that start mydomainname.com but also those that don't have domain name specified like /pathout_themes/pathology_outlines/images/logo_for_social.jpg
I started from this:
(https?:\/\/)|(www\.)?[-a-zA-Z0-9@:%._\ ~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\ .~#?&//=]*)
that maches all urls (no matter the domain name) but it doesn't match those URLs without domain name like the one given example above. This one is doing basically the same thing:
(?:(http|ftp|https):\/\/)?[\w-] (\.[\w-] ) ([\w.,@?^=%&;:\/~ #-]*[\w@?^=%&;\/~ #-])?
Then I was trying this:
(https://www.mydomainname.com/)|(w-)(/)?
but it is not matching full URLs, it matches only the domain name. And also it doesn't match urls without domain name specifed like that one above.
I need regex to match all URLs with and without domain name specified too.
I have already read the following topics:
Need a regex to match URL (without http, https, ftp)
help rewriting this url. simple but not working
Unfortunately none of them is what I need.
CodePudding user response:
I think you're on the right track. I basically used your original regex to match the patterns including the domain name. For those with a relative URL, I chose to require quotes around the pathname and require that the pathname start with /
. You'll have to look at your data to decide if those restrictions make sense for you or not. For example, you can remove the "quote" groups, but you will match any content like /jk
in your input.
import re
domainname_re = '(?P<match>(http|https|ftp)://www.mydomainname.com[a-zA-Z0-9@:%\._\ \/\&~##
=]*)'
nondomainname_re = '(?P<quote>[\'"])(?P<match2>\/[a-zA-Z0-9@:%\._\ ~#=\/\&]*)(?P=quote)'
all_re = '(' domainnname_re '|' nondomainname_re ')'
# to test on your test input:
for match in re.finditer(all_re, testinput):
matchdict = match.groupdict()
print(matchdict['match'] or matchdict['match2'])
To be clear, the final pattern I used is:
'((?P<match>(http|https|ftp)://www.mydomainname.com[a-zA-Z0-9@:%\._\ \/\&~#=]*)|(?P<quote>['"])(?P<match2>\/[a-zA-Z0-9@:%\._\ ~#=\/\&]*)(?P=quote))'
Hopefully that helps you find your next step!