I'm attempting to scrape a website and these two rules seem to be contradictory in robots.txt
User-agent: *
Disallow: *
Allow: /
Does Allow: /
mean that I can scrape the entire website, or just the root? As if means I can scrape the entire site then this is directly contradictory to the previous rule.
CodePudding user response:
If you are following the original robots.txt standard:
- The
*
in the disallow line would be treated as a literal rather than a wildcard. That line would disallow URL paths that start with an asterisk. All URL paths start with a/
, so that rule disallows nothing. - The
Allow
Rule isn't in the specification, so that line would be ignored.
Verdict: you can crawl the site.
Google and a few other crawlers support wildcards and allows. If you are following Google's extensions to robots.txt, here is how Google would interpret this robots.txt:
- Both
Allow: /
andDisallow: *
match any specific path on the site. - In the case of such a conflict, the more specific rule (ie longer) rule wins.
/
and*
are each one character, so neither is considered more specific than the other. - In a case of a tie for specificity, the least restrictive rule wins.
Allow
is considered less restrictive thanDisallow
.
Verdict: You can crawl the site.