My site got hacked recently and has over 3 million pages now when it only has 30 pages (see screenshot).
How do I implement the correct 410 header in .htaccess
?
I think the best tactic is to 410 all pages that contain a number OR .htm OR .html
as none of the real pages have these in the URL. For example -
https://example.com/cixc-20050gsakuramar/-b00006.htm
https://example.com/sfumato.php?nzlw-21833vetidm4
https://example.com/bzmt-5694ceti.html
https://example.com/pfks-14602sjp/ucqksti.htm
https://example.com/admv-15974mitem/318
Would this code work?
Redirect 410 /*0*
Redirect 410 /*1*
Redirect 410 /*2*
Redirect 410 /*3*
Redirect 410 /*4*
Redirect 410 /*5*
Redirect 410 /*6*
Redirect 410 /*7*
Redirect 410 /*8*
Redirect 410 /*9*
Redirect 410 /*.html*
Redirect 410 /*.htm*
I've also pieced together a rewrite rule which might also work?
RewriteRule ^([0-9] )$ - [G,L]
I am also thinking of adding Disallow to robots.txt
like this -
Disallow: /*0*
Disallow: /*1*
Disallow: /*2*
Disallow: /*3*
Disallow: /*4*
Disallow: /*5*
Disallow: /*6*
Disallow: /*7*
Disallow: /*8*
Disallow: /*9*
Disallow: /*.htm
Disallow: /*.html
CodePudding user response:
The redirect directive of mod_alias doesn't support wild cards. So your rules such as Redirect 410 /*0*
would not do what you expect. You could make them into RedirectMatch
directives which support regular expressions. I'd combine all the numbers into one rule, and html suffixes into another:
RedirectMatch Gone ".*[0-9].*"
RedirectMatch Gone ".*\.html?$"
From your Google Search Console screenshot, it looks like some of the URLS have query strings in them with a ?
. mod_alias
doesn't consult the query string at all when matching the URL. If the .html
appears in the query string and not in the URL path, RedirectMatch
won't be able to match it.
I'd recommend going with mod_rewrite
rules which can match the query string. Another reason to prefer .htaccess would be if you have other rewrite rules in your .htaccess
. Additional rewrite rules would be less likely to conflict than mod_alias rules.
I've added a condition to skip wp-content
URLs because in the comments, you say you actually have some CSS files with numbers in them.
RewriteEngine on
RewriteCond %{REQUEST_URI} !^/?wp-content/
RewriteCond %{REQUEST_URI} !pagespeed
RewriteCond %{REQUEST_URI} !fontawesome
RewriteCond %{REQUEST_URI} !webfont
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule [0-9] - [G,L]
RewriteRule \.html?$ - [G,L]
RewriteCond %{QUERY_STRING} !v(er)?=
RewriteCond %{QUERY_STRING} [0-9]
RewriteRule . - [G,L]
RewriteCond %{QUERY_STRING} \.html?$
RewriteRule . - [G,L]
I wouldn't recommend using a Disallow
in robots.txt
because Google sometimes indexes disallowed URLs anyway even if it can't crawl them.