- I want to allow image crawling on my site from a couple of different bots and exclude all others.
- I want to allow images in at least one folder to not be blocked for any request.
- I don't want to block image requests from visitors on my own site.
- I don't want to include my domain name in the .htaccess file for portability.
The reason I ask this here and don't simply test the following code myself is that I work on my own and have no colleges to ask or external resources to test from. I think what I've got is correct but I find .htaccess rules extremely confusing, and I don't know what I don't even know at this point.
RewriteCond %{HTTP_REFERER} !^$ [OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?facebook\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?google\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?instagram\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?linkedin\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?reddit\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?twitter\.. $ [NC,OR]
RewriteCond %{REQUEST_URI} !^/cross-origin-resources/ [NC,OR]
RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1/.* [NC]
RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]
I've tested it on htaccess tester, and looks good but does complain about the second last line when tested using the following URL: http://www.example.co.uk/poignant/foo.webp
CodePudding user response:
You have the logic in reverse. As written these conditions (RewriteCond
directives) will always be successful and the request will always be blocked.
You have a series of negated conditions that are OR'd. These would only fail (ie. not block the request) if all the conditions match, which is impossible. (eg. The Referer
header cannot be bing
and facebook
.)
You need to remove the OR
flag on all your RewriteCond
directives, so they are implicitly AND'd.
- I want to allow image crawling on my site from a couple of different bots and exclude all others.
Once you've corrected the OR
/AND as stated above, this rule will likely allow all bots to crawl your site images because bots generally do not send a Referer
header. These directives are allowing certain websites to display your images on their domain (ie. hotlinking). This is probably the intention, however, it's not what you are stating in point #1.
(To block bots from crawling your site you would need to check the User-Agent
request header - which would probably be better done in a separate rule.)
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\.. $
Minor point, but the $
at the end of the regex is superfluous. There's no need to match the entire Referer
when you are only interested in the hostname. Although these sites (or browser) probably have a Referrer-Policy set that prevents the URL-path being sent in the Referer
header anyway, but it is still unnecessary.
RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1/.* [NC]
In comments, you were asking what this line does. This satisfies points #3 and #4 in your list. It ensures that the requested Host
header (HTTP_HOST
) matches the hostname in the Referer
. So the request is coming from the same site.
(Again, the trailing .*
on the regex is unnecessary and should be removed.)
This is achieved using an internal backreference \1
in the regex against the HTTP_REFERER
that matches HTTP_HOST
in the TestString (first argument). The @@
string is just an arbitrary string that does not occur in the HTTP_HOST
or HTTP_REFERER
server variables.
This is clearer if you expand the TestString to see what is being matched. If you make an internal request to https://example.com/myimage.jpg
from your homepage (ie. https://example.com/
) then the TestString in the RewriteCond
directive is:
example.com@@https://example.com/
This is then matched against the regex ^([^@]*)@@https?://\1/
.
([^@]*)
- the first capturing group capturesexample.com
(The value ofHTTP_HOST
).@@https?://
- simply matches@@https://
in the TestString.\1
- this is an internal backreference. So this must match the value captured from the first captured group (above). In this example, it must matchexample.com
. And it does, so there is a successful match.- The
!
prefix on the CondPattern (not strictly part of the regex), negates the whole expression, so the condition is successful when the regex does not match.
So, in the above example, regex matches and the condition fails (so the rule is not triggered and the request is not blocked).
However, if a request is made to https://example.com/myimage.jpg
from an external site, eg. https://external-site.example/
then the TestString in the RewriteCond
directive is:
example.com@@https://external-site.example/
Following the steps above, the regex fails to match. The negated condition is therefore successful and the rule is triggered, so the request is blocked. (Unless one of the other conditions failed.)
RewriteCond %{HTTP_REFERER} !^$
This allows an empty (or not present) Referer
header. You "probably" do need this. It allows bots to crawl your images. It permits direct requests to images. It also allows users who have chosen to suppress the Referer
header to be able to view your images on your site.
HOWEVER, it's also possible these days for a site to set a Referrer-Policy that completely suppresses the Referer
header being sent (by the browser) and so bypass your hotlink protection.
RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]
Minor point, but the L
flag is not required when the F
flag is used (it is implied).
Are you really serving .bmp
images?!