Home > database >  How do I implement a 410 header in .htaccess to remove hacked URLs?
How do I implement a 410 header in .htaccess to remove hacked URLs?

Time:07-28

My site got hacked recently and has over 3 million pages now when it only has 30 pages (see screenshot).

How do I implement the correct 410 header in .htaccess?

I think the best tactic is to 410 all pages that contain a number OR .htm OR .html as none of the real pages have these in the URL. For example -

  • https://example.com/cixc-20050gsakuramar/-b00006.htm
  • https://example.com/sfumato.php?nzlw-21833vetidm4
  • https://example.com/bzmt-5694ceti.html
  • https://example.com/pfks-14602sjp/ucqksti.htm
  • https://example.com/admv-15974mitem/318

Would this code work?

Redirect 410 /*0*
Redirect 410 /*1*
Redirect 410 /*2*
Redirect 410 /*3*
Redirect 410 /*4*
Redirect 410 /*5*
Redirect 410 /*6*
Redirect 410 /*7*
Redirect 410 /*8*
Redirect 410 /*9*
Redirect 410 /*.html*
Redirect 410 /*.htm*

I've also pieced together a rewrite rule which might also work?

RewriteRule ^([0-9] )$ - [G,L]

I am also thinking of adding Disallow to robots.txt like this -

Disallow: /*0*
Disallow: /*1*
Disallow: /*2*
Disallow: /*3*
Disallow: /*4*
Disallow: /*5*
Disallow: /*6*
Disallow: /*7*
Disallow: /*8*
Disallow: /*9*
Disallow: /*.htm
Disallow: /*.html

Screenshot

CodePudding user response:

The redirect directive of mod_alias doesn't support wild cards. So your rules such as Redirect 410 /*0* would not do what you expect. You could make them into RedirectMatch directives which support regular expressions. I'd combine all the numbers into one rule, and html suffixes into another:

RedirectMatch Gone ".*[0-9].*" 
RedirectMatch Gone ".*\.html?$" 

From your Google Search Console screenshot, it looks like some of the URLS have query strings in them with a ?. mod_alias doesn't consult the query string at all when matching the URL. If the .html appears in the query string and not in the URL path, RedirectMatch won't be able to match it.

I'd recommend going with mod_rewrite rules which can match the query string. Another reason to prefer .htaccess would be if you have other rewrite rules in your .htaccess. Additional rewrite rules would be less likely to conflict than mod_alias rules.

I've added a condition to skip wp-content URLs because in the comments, you say you actually have some CSS files with numbers in them.

RewriteEngine on
RewriteCond %{REQUEST_URI} !^/?wp-content/
RewriteCond %{REQUEST_URI} !pagespeed
RewriteCond %{REQUEST_URI} !fontawesome
RewriteCond %{REQUEST_URI} !webfont
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule [0-9] - [G,L]
RewriteRule \.html?$ - [G,L]
RewriteCond %{QUERY_STRING} !v(er)?=
RewriteCond %{QUERY_STRING} [0-9]
RewriteRule . - [G,L]
RewriteCond %{QUERY_STRING} \.html?$
RewriteRule . - [G,L]

I wouldn't recommend using a Disallow in robots.txt because Google sometimes indexes disallowed URLs anyway even if it can't crawl them.

  • Related