Home > database >  Contradictory rules in robots.txt
Contradictory rules in robots.txt

Time:04-18

I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt

User-agent: *
Disallow: *
Allow: /

Does Allow: / mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.

CodePudding user response:

If you are following the original robots.txt standard:

  • The * in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a /, so that rule disallows nothing.
  • The Allow Rule isn't in the specification, so that line would be ignored.

Verdict: you can crawl the site.


Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:

  • Both Allow: / and Disallow: * match any specific path on the site.
  • In the case of such a conflict, the more specific rule (ie longer) rule wins. / and * are each one character, so neither is considered more specific than the other.
  • In a case of a tie for specificity, the least restrictive rule wins. Allow is considered less restrictive than Disallow.

Verdict: You can crawl the site.

  • Related