.htaccess, prevent hotlinking, allow big bots, allow some access and allow my own domain without add-CodePudding

I want to allow image crawling on my site from a couple of different bots and exclude all others.
I want to allow images in at least one folder to not be blocked for any request.
I don't want to block image requests from visitors on my own site.
I don't want to include my domain name in the .htaccess file for portability.

The reason I ask this here and don't simply test the following code myself is that I work on my own and have no colleges to ask or external resources to test from. I think what I've got is correct but I find .htaccess rules extremely confusing, and I don't know what I don't even know at this point.

RewriteCond %{HTTP_REFERER} !^$ [OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?facebook\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?google\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?instagram\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?linkedin\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?reddit\.. $ [NC,OR]
RewriteCond %{HTTP_REFERER} !^https?://(www\.)?twitter\.. $ [NC,OR]
RewriteCond %{REQUEST_URI} !^/cross-origin-resources/  [NC,OR]
RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1/.* [NC]
RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]

I've tested it on htaccess tester, and looks good but does complain about the second last line when tested using the following URL: http://www.example.co.uk/poignant/foo.webp

CodePudding user response：

You have the logic in reverse. As written these conditions (RewriteCond directives) will always be successful and the request will always be blocked.

You have a series of negated conditions that are OR'd. These would only fail (ie. not block the request) if all the conditions match, which is impossible. (eg. The Referer header cannot be bing and facebook.)

You need to remove the OR flag on all your RewriteCond directives, so they are implicitly AND'd.

I want to allow image crawling on my site from a couple of different bots and exclude all others.

Once you've corrected the OR/AND as stated above, this rule will likely allow all bots to crawl your site images because bots generally do not send a Referer header. These directives are allowing certain websites to display your images on their domain (ie. hotlinking). This is probably the intention, however, it's not what you are stating in point #1.

(To block bots from crawling your site you would need to check the User-Agent request header - which would probably be better done in a separate rule.)

RewriteCond %{HTTP_REFERER} !^https?://(www\.)?bing\.. $

Minor point, but the $ at the end of the regex is superfluous. There's no need to match the entire Referer when you are only interested in the hostname. Although these sites (or browser) probably have a Referrer-Policy set that prevents the URL-path being sent in the Referer header anyway, but it is still unnecessary.

RewriteCond %{HTTP_HOST}@@%{HTTP_REFERER} !^([^@]*)@@https?://\1/.* [NC]

In comments, you were asking what this line does. This satisfies points #3 and #4 in your list. It ensures that the requested Host header (HTTP_HOST) matches the hostname in the Referer. So the request is coming from the same site.

(Again, the trailing .* on the regex is unnecessary and should be removed.)

This is achieved using an internal backreference \1 in the regex against the HTTP_REFERER that matches HTTP_HOST in the TestString (first argument). The @@ string is just an arbitrary string that does not occur in the HTTP_HOST or HTTP_REFERER server variables.

This is clearer if you expand the TestString to see what is being matched. If you make an internal request to https://example.com/myimage.jpg from your homepage (ie. https://example.com/) then the TestString in the RewriteCond directive is:

example.com@@https://example.com/

This is then matched against the regex ^([^@]*)@@https?://\1/.

([^@]*) - the first capturing group captures example.com (The value of HTTP_HOST).
@@https?:// - simply matches @@https:// in the TestString.
\1 - this is an internal backreference. So this must match the value captured from the first captured group (above). In this example, it must match example.com. And it does, so there is a successful match.
The ! prefix on the CondPattern (not strictly part of the regex), negates the whole expression, so the condition is successful when the regex does not match.

So, in the above example, regex matches and the condition fails (so the rule is not triggered and the request is not blocked).

However, if a request is made to https://example.com/myimage.jpg from an external site, eg. https://external-site.example/ then the TestString in the RewriteCond directive is:

example.com@@https://external-site.example/

Following the steps above, the regex fails to match. The negated condition is therefore successful and the rule is triggered, so the request is blocked. (Unless one of the other conditions failed.)

RewriteCond %{HTTP_REFERER} !^$

This allows an empty (or not present) Referer header. You "probably" do need this. It allows bots to crawl your images. It permits direct requests to images. It also allows users who have chosen to suppress the Referer header to be able to view your images on your site.

HOWEVER, it's also possible these days for a site to set a Referrer-Policy that completely suppresses the Referer header being sent (by the browser) and so bypass your hotlink protection.

RewriteRule \.(bmp|gif|jpe?g|png|webp)$ - [F,L,NC]

Minor point, but the L flag is not required when the F flag is used (it is implied).

Are you really serving .bmp images?!