Whitelisting the robots.txt file is a simple thing to do and easily gets rid of hundreds of well behaved polite crawlers that actually read and obey the robots.txt directives.

Blacklist vs. Whitelist:
One Works, One Doesn’t.

Most people blacklist robots.txt, which means you have to list every bot you don’t want on the site. Blacklisting is futile because there are literally thousands of them and dozens, if not hundreds, new bots you don’t know about come online daily. Therefore, the only logical way to control your content distribution is to whitelist unless you actually like spending the day chasing down every new bot name that appears in your log files.

Let’s make it simple:

  • WHITELISTING: Only allowed bots crawl the site
  • BLACKLISTING: Everything not listed crawls the site

Sample Robots.txt File:

Construct a robots.txt file as follows:

# allowed bots
User-agent: bingbot
User-agent: google
Disallow:

# tell everyone else to go away
User-agent: *
Disallow: /

Make sure you include everything you need to crawl your site in the allowed list as it will go away. It won’t go away mad and never return, it will simply go away until it decides to check your robots.txt file again.  Some bots denied access to robots.txt go bat crap crazy asking for it up to 20+ times in a single day, but many do honor the directives to not crawl the site.

Robots.txt Tells Spiders To Go Away Nicely

While robots.txt is just a nice way of asking bots not to crawl your site it has no teeth and they can, and sometimes do, completely ignore robots.txt. Some people say why bother with robots.txt since it can’t be enforced (not true, just a myth spread by the less technical webmasters) but it’s the proper way to handle well behaved bots. However, some bad bots come and read robots.txt to see which bots you allow and they will assume those bot names to attempt to gain access. Basically, if you see a bot misbehaving you cannot judge it on the user agent name alone, but the IP address to make sure it’s really who it claims but that’s getting a bit advanced for this brief article.

How To Enforce Robots.txt

To enforce the robots.txt file there are some surprisingly simple methods you can use such as the .htaccess file which we’ll discuss in a future post. Another option is installing a PHP robots.txt class library normally used by web crawlers and use it in reverse to validate access by the user agent requesting access. Both advanced robots.txt enforcement topics will be discussed in detail in future posts so stay tuned!watch full Mission: Impossible – Rogue Nation 2015 movie online

Content Control Part 2: Coming Soon