Contradictory rules in robots.txt-CodePudding

I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt

User-agent: *
Disallow: *
Allow: /

Does Allow: / mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.

CodePudding user response：

If you are following the original robots.txt standard:

The * in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a /, so that rule disallows nothing.
The Allow Rule isn't in the specification, so that line would be ignored.

Verdict: you can crawl the site.

Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:

Both Allow: / and Disallow: * match any specific path on the site.
In the case of such a conflict, the more specific rule (ie longer) rule wins. / and * are each one character, so neither is considered more specific than the other.
In a case of a tie for specificity, the least restrictive rule wins. Allow is considered less restrictive than Disallow.

Verdict: You can crawl the site.