I’m trying to clean up our SEO by catching non-canonical URLs that are being indexed by Google.
Here is a sample of one of our non-canonical URLs
https://www.umpqua.edu/184-about-ucc/facts-visitor-info?start=1
I can catch it with this RegEx (see below) in the HTACCESS file but it also disabled other URLs that I want to work. It catches URLs with /NUMBER-
. The number is two-three characters in length.
/([0-9]{2,3})-
So I'm trying to make it more unique. I have tried this (below) without success. My hope is to catch URLs with edu/NUMBER-
(edu)/([0-9]{2,3})-
I have also tried
(edu/)([0-9]{2,3})-
Here is my full HTACCESS entry:
RewriteCond %{REQUEST_URI} ^(edu)/([0-9]{2,3})-$
RewriteRule .* index.php [G]
CodePudding user response:
adding "edu" is just me trying make the RegEx more selective. So when I was using this expression
/([0-9]{2,3})-
it worked well except it also matched with this url./component/weblinks/weblink/239-external-links/…
but it should not have.
The significant thing about edu
is that it is before the start of the URL-path. (But it's not part of the URL-path, it is the end part of the Host
header.) In that case, just anchor the regex to the start of the URL-path. For example:
RewriteRule ^\d{2,3}- - [G]
This needs to go near the top of the root .htaccess
file.
\d
is just short for [0-9]
. Note there are 3 arguments in the above directive, separated by spaces:
^\d{2,3}-
... The pattern that matches against the URL-path-
... The substitution string (in this case a single hyphen)[G]
... The flags. In this caseG
forgone
(short forR=410
).
The above will serve a "410 Gone" for any URL-path that starts with 2 or 3 digits followed by a hyphen. There is a single hyphen in the substitution string to explicitly indicate "no substitution". Using index.php
here is superfluous since it is ignored.
Note that there is no slash prefix on the URL-path matched by the RewriteRule
pattern when used in .htaccess
.
You do not need a separate condition (RewriteCond
directive) - the comparison can more easily/efficiently be performed in the RewriteRule
directive itself.
So the above will block /184-about-ucc/facts-visitor-info?start=1
but not /component/weblinks/weblink/239-external-links/...
, since the 3 digits in the second URL do not occur at the start of the URL-path.