Removing sitemaps.xml & robots.txt from the SERPs

It can happen that your sitemap.xml or your robots.txt file finds it’s way into the index. Just do the following query filetype:xml to see what XML files you have listed from your domain. Here is an example of some files indexed for the domain

It’s probably not what you really want, as basically it’s just trash in the SERPs. To fix this, and remove it from the SERPs,  you can simply add some extra details to your .htaccess file which will send the the proper X-Robots-Tag.

For your .htaccess file

<FilesMatch "sitemap\.xml">
Header set X-Robots-Tag "noindex"

<FilesMatch "robots\.txt">
Header set X-Robots-Tag "noindex"

Block Word documents from being indexed

This method can also be used to remove all word documents or similar from the index.

<FilesMatch "\.doc$">
Header set X-Robots-Tag "index, noarchive, nosnippet"

To check your MIME type is a handy tool to check out the MIME headers, cache control and FileTypes. Their code is open-source so you can run a version on your server.

Update May 2014

Issues like this still seem to occur with Google. Here is an example where, by searching for a company name, one of the sitelinks showed a link to the Robots.txt. Sitelinks are a useful tool, and should be used to help customers get to where they need quickly. This company could simply block this page showing in the site links via the Google Webmaster Control panel.



Thanks to some people like Carlo Zottmann , JohnMu, & Paul Cawley for giving me some pointers on this. 🙂


Leave a Reply

Your email address will not be published.