I am trying to block certain user agents from access the search pages, mostly the bots and crawlers as they end up increasing the CPU usage.
Using htaccess Rewrite engine of course. And I currently have this (have been trying with a lot of different combination of rules that I found on SO and other places)
# Block user agents
ErrorDocument 503 "Site temporarily disabled for crawling"
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(bots).*$ [NC]
# RewriteCond %{QUERY_STRING} ^s=(.*)$
# RewriteCond /shop(?:\.php)?\?s=([^\s&] )
RewriteCond %{QUERY_STRING} !(^|&)s=*
RewriteCond %{REQUEST_URI} !^/robots.txt$
# RewriteRule ^shop*$ /? [R=503,L]
RewriteRule ^shop$ ./$1 [R=503,L]
Sorry about the many commented out lines - as I mentioned, I have been trying a lot of different things but it appears that htaccess rewrite rules are not my cup of tea.
What I want to do is, if the user agent contains "bot" then return a 503 error. Conditions are
- User agent contains "bots" - this part is working fine, I tested it
- If there is a
s
query string, with anything in it. - It's not robots.txt url (at this point I think I should remove it, not even needed)
- Finally, if the above matches redirect
/shop/?s=
or/shop?s=
to root and serve 503 error document.
CodePudding user response:
With your shown samples/attempts, please try following htaccess Rules file.
Please make sure to clear your browser cache before testing your URLs.
# Block user agents
ErrorDocument 503 "Site temporarily disabled for crawling"
RewriteEngine On
##1st condition here(User agent contains "bots")....
RewriteCond %{HTTP_USER_AGENT} ^.*(bots).*$ [NC]
RewriteRule ^ - [R=503,L]
##2nd condition here(If there is a s query string, with anything in it)...
RewriteCond %{THE_REQUEST} !\s.*robot\.txt\s [NC]
RewriteRule ^ - [R=503,L]
##3rd condition here(query string contains s in it)...
RewriteCond %{THE_REQUEST} \s.*\?(.*s.*)\s [NC]
RewriteRule ^ - [R=503,L]
##4th condition here(match /shop/?s= or /shop?s= and get 503 in those requests)...
RewriteCond %{THE_REQUEST} \shop/?\?s=.*s [NC]
RewriteRule ^ - [R=503,L]
CodePudding user response:
Since you clearly defined the criteria by which to decide it is straight forward to implement them. I understand your question such that all of those criteria have to be fulfilled ...
RewriteEngine On
RewriteCond %{ENV:REDIRECT_STATUS} !=503
RewriteCond %{HTTP_USER_AGENT} bots [NC]
RewriteCond %{QUERY_STRING} (?:^|&)s=[^&] (?:&|$)
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule ^/?search/?$ - [R=503,L]
Not sure why you test for "bots" instead of "bot" though (your question contradicts itself in that).
CodePudding user response:
- Finally, if the above matches redirect
/shop/?s=
or/shop?s=
to root and serve 503 error document.
You can't "redirect to root" and "serve 503 error document". A redirect is a 3xx response. You could internally rewrite the request to root and send a 503 HTTP response status by defining ErrorDocument 503 /
. However, that rather defeats the point of a 503 and does not help "CPU usage". Just serving the static string for the 503 response, as you have already defined would seem to be the best option.
RewriteRule ^shop$ ./$1 [R=503,L]
When you set a code other than in the 3xx range, the substitution string (ie. ./$1
) is completely ignored. You should simply include a hyphen (-
) as the substitution string to explicitly indicate "no substitution". The L
flag is also superfluous here. When specifying a non-3xx return code, the L
flag is implied.
It's not robots.txt url (at this point I think I should remove it, not even needed)
Agreed, this check is entirely superfluous. You are already checking that the request is /shop
, so it cannot possibly be /robots.txt
as well. (Any request for /robots.txt
does not include a query string either.)
RewriteCond %{QUERY_STRING} !(^|&)s=*
For some reason you've negated (!
prefix) the condition (perhaps in an attempt to make it work?) - but you need this to be a positive match for the s
URL parameter. Note that the regex (^|&)s=*
is incorrect. The trailing *
matches the preceding pattern 0 or more times. So this regex will only match s
or s=
or s==
and fail for s=foo
.
To match the s
URL parameter with "anything in it" (including nothing) you simply need to remove the trailing *
, eg. (^|&)s=
. To match the s
URL param with something as the value then match a single character, except &
. eg. (^|&)s=[^&]
.
- User agent contains "bots" - this part is working fine, I tested it
"bots" or "bot", as mentioned in the preceding sentence? I can't imagine "bots" matching many bots, but "bot" will match a lot!
RewriteCond %{HTTP_USER_AGENT} ^.*(bots).*$ [NC]
This regex is rather complex for what it does. It also unnecessarily captures the word "bots" (which isn't being used later). The regex ^.*(bots).*$
is the same as simply bots
(without the capturing group).
Taking the above points into consideration, we have:
ErrorDocument 503 "Site temporarily disabled for crawling"
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bots [NC]
RewriteCond %{QUERY_STRING} (?:^|&)s=
RewriteRule ^shop/?$ - [R=503]
The above rule does the following:
- Matches the URL-path
shop
orshop/
- And the user-agent string contains the word "bots"
- And the URL parameter
s=
is present (anywhere) - If the above all match then a 503 (static string) is served.
However, I would query whether a 503 is really the correct response. Generally, you don't want bots to crawl internal search results at all; ever. It can be a bottomless pit and wastes crawl budget. Should this perhaps be a 403 instead?
And, are you blocking these URLs in robots.txt
already? (And this isn't sufficient to stop the "bad" bots?) If not, I would consider adding the following:
User-agent: *
Disallow: /shop?s=
Disallow: /shop/?s=
Disallow: /shop?*&s= # If the "s" param can occur anywhere
Disallow: /shop/?*&s= # (As above)