How do I prevent search engine crawlers from indexing a domain on AWS?-CodePudding

We have a staging environment running on a .dev domain alongside our production environment running on another domain.

Is there any way to prevent our .dev domain from being indexed?

I don't want our staging website to be found on Google when searching for the product.

The domain is being hosted on AWS, using Route 53 with Cloudfront as a CDN.

Applications are hosted on ECS with a load balancer in front.

CodePudding user response：

If you have permission to add files to the domain e.g. S3 bucket, EC2 instance, ECS container instance etc., place a robots.txt file in the root folder.

Set the contents as:

User-agent: *
Disallow: /

Make sure to allow public read access to the file (object) so search engine crawlers belonging to Google, Bing etc. can find and process it.

This will prevent bots from indexing your files, and thus your domain.

Please note that if your production domain points to the staging website, Google's crawler can still index your staging website as the search engine crawler will crawl from your prod. website to your staging one.

In this case, robots.txt won't always prevent the website from being indexed & you'll need the X-Robots-Tag: noindex HTTP response header returned for the files your CloudFront distribution returns.

In that case, you need a more complex solution such as using AWS Lambda@Edge to add the headers if you don't have a web server serving your requests.

That would definitely prevent indexing by Google regardless of if the page is linked to or not.