A PHP Rogue Robot Trap

Some user agents wilfully disregard the access rules laid down in the robots.txt file. Typically these are agents that up to no good - email address collectors, ruthless web site downloaders, and so on.

This PHP script in conjunction with the robots.txt file is designed to identify such agents, and ban them from the site using the htaccess file.

The basic mechanism used is to include a line in the robots.txt file which disallows access to a special directory, say robottrap.

User-agent: *
Disallow: /robottrap/

For some rogue robots, the existance of a forbidden directory is reason enough to visit it. For others, you may need to tempt them with a link on your index page which is inaccessible to humans, such as:

<a href="/robottrap/robottrap.php"></a>

The robottrap.php script has three functions:

  1. It records the date and time of the hit, the name of the user agent, the originating IP address, and the domain name in a file. This file may be listed using the supplied transaction robotreport.php.
  2. It sends the same details by email to the given email address.
  3. It updates the .htaccess file to ban that IP address from the website.

There are two variables which need to be set up to allow the script to be tailored to local requirements:

  1. The name and location of the file where details of the rogue robots are to be stored. Note that this must have read / write privileges for all users.
  2. The email address to where the details of the rogue agents should be sent.

Full instructions on how to set up the rogue robot trap is included in the header comments of the transaction.

Finally, there is a web transaction, robots.php that displays a full list of robots that have fallen into your trap. Note that some so-called "web accelerators" which simply pre-fetch all links on a page at cost to your bandwidth will also be caught by the trap.

Arie Slob has kindly pointed out that there should be an interval between uploading the new robots.txt file, and setting the trap, to ensure that valid bots are not using an old cached version of the file. Twenty four hours should suffice.

Download Zip File (7,763 bytes).