Home > Enterprise >  robots.txt and htaccess (while CMS is in sub-folder)
robots.txt and htaccess (while CMS is in sub-folder)

Time:05-12

My CMS is placed in a subfolder, so via .htaccess I forward everything. Good for the cms and that following snippet works without any problems, but bad for files like robots.txt, which have to be stored in the web root (e. g. https://domain.xyz/robots.txt). If I call that URL the browser and the crawlers will be (of course) forwarded to https://domain.xyz/TEST

<IfModule mod_rewrite.c>
    RewriteEngine On

    RewriteCond %{HTTPS} !=on
    RewriteRule ^ https://domain.xyz%{REQUEST_URI} [L,R=301]

    RewriteCond %{HTTP_HOST} !^domain\.xyz$ [NC]
    RewriteRule ^ https://domain.xyz/TEST [L,R=301]

    RewriteCond %{REQUEST_URI} !^/TEST
    RewriteRule ^ https://domain.xyz/TEST [L,R=301]
</IfModule>

So I have to skip that file(s) and I would add

RewriteCond %{THE_REQUEST} !/(robots\.txt|sitemap\.xml)\s [NC]

for the files robots.txt and sitemap.xml before the RewriteRule, but it doesn't work. What's wrong? Could somebody please help me with that? Thank you.

CodePudding user response:

Arguably, this is not "forwarding", this is "redirecting", as in an external redirect. Forwarding would more commonly be used to describe an internal rewrite (where the URL does not change).

but bad for files like robots.txt, which have to be stored in the web root

Not necessarily. They don't need to be stored in (and accessed from) the web root. Google and other search engines do follow redirects when requesting robots.txt, XML sitemaps and similar files. From the Google Docs for robots.txt - "Handling of errors and HTTP status codes":

3xx (redirection)
Google follows at least five redirect hops as defined by RFC 1945 and then stops and treats it as a 404 for the robots.txt.

However, you can still include an exception if you wish, but you have an error in your regex...

RewriteCond %{THE_REQUEST} !/(robots\.txt|sitemap\.xml)\s [NC]

You have an erroneous \s (literal space character) at the end of the CondPattern - so this will never match and the condition is always successful. Perhaps you meant to write $ (end-of-string anchor)? You are also missing the start-of-string anchor.

For example, it should be:

RewriteCond %{THE_REQUEST} !^/(robots\.txt|sitemap\.xml)$ [NC]

OR, include a positive matching rule before your existing rules that prevents any later rule (ie. redirects) from occurring when a request for one of these files are made:

# Prevent further processing if "robots.txt" or "sitemap.xml" requested
RewriteRule ^(robots\.txt|sitemap\.xml)$ - [NC,L]
RewriteRule ^ https://domain.xyz/TEST [L,R=301]

Since TEST is a physical directory you should append a trailing slash to the redirected URL, ie. /TEST/, otherwise Apache (mod_dir) will append the trailing slash with a second redirect.

You will need to clear your browser cache before testing since the 301 (permanent) redirects will have been cached by the browser.

  • Related