I would like to prevent search engines from indexing with some specific get parameters.
Example:
https://www.example.com/mypage.php
should be indexedhttps://www.example.com/mypage.php?myparam=1
should not be indexed
I have many pages (more than 10k) with get parameters that are indexed on top of the main page despite sending a no index in the header.
I have this in robots.txt:
Disallow: /*?*myparam=
And this in the html:
<meta name="robots" content="noindex" />
All pages also have a canonical referring to the main page (without get params)
I would like to try the following but as it may have catastrophic consequences if I doing it wrongly, I'd like to check if my approach is ok:
<FilesMatch "\.php">
<If "%{QUERY_STRING} =~ /myparam/">
Header set X-Robots-Tag "noindex, noarchive"
</If>
</Files>
Is this approach ok? Or do you see a better one?
CodePudding user response:
I have this in robots.txt:
Disallow: /*?*myparam=
That's the problem.
robots.txt
prevents search engine bots from crawling your site. This does not necessarily prevent these pages from getting indexed if these pages are being linked to.
If you prevent crawling then search engine bots are not going to see the meta robots tag in the HTML or X-Robots-Tag
HTTP response header, because the page is never requested.
(Although you would normally be notified of this in the search results with a search description along the lines of "A description for this result is not available because of this site's robots.txt - learn more".)
So, you should remove the entry in the robots.txt
file.
However, there's also the matter of how you are determining the pages are indexed and whether this is really a concern. For example, if you are using a site:
search then this often returns URLs that are not ordinarily returned in organic searches. It is unusual for a URL that is blocked in robots.txt
to be returned in the organic search results since the "content" of that page is not indexed - just the URL. Often, a site:
search is the only way to dig out these URL-only "indexed" URLs.
All pages also have a canonical referring to the main page (without get params)
This should be enough by itself and is the preferred option if the non-parameter version of the URL is genuinely the canonical version (ie. not a completely different page).
The "canonical" tag (if honoured) will effectively pass link-juice to the canonical URL.
However, the canonical tag is only "advisory". If Google determines that the canonical URL is not really canonical (eg. if it is sufficiently different) then it is ignored.
You can also resolve URL parameter canonicalisation in GSC.
<FilesMatch "\.php"> <If "%{QUERY_STRING} =~ /myparam/"> Header set X-Robots-Tag "noindex, noarchive" </If> </Files>
UPDATE: The closing "tag" should be </FilesMatch>
, not </Files>
. Or, use the non-regex <Files ".php">
directive instead (preferable).
Otherwise, this is "OK", except that it does potentially catch too much. It would set the X-Robots-Tag
header on any request that maps to a file that contains .php
- not just a file extension (even if the requested URL itself is not for a .php
file) and that request contains the string myparam
anywhere in the query string (which is a bit general, since it would also match abcmyparamxyz=1
- if that is a possibility).
You could be more specific and avoid the <FilesMatch>
directive. For example:
<If "%{REQUEST_URI} == '/mypage.php' && %{QUERY_STRING} =~ /(^|&)myparam=/">
Header set X-Robots-Tag "noindex, noarchive"
</If>