Posted on Leave a comment

Indexing Search Engine Spider and the Robots.txt

When a search engine spider (crawler) comes to your site, it will look for a special file on your site. That file is called robots.txt and it tells the search engine spider, which Web pages of your site should be indexed and which Web pages should be ignored.

The robots.txt file is a simple text file (no HTML), that must be placed in your root directory, for example:

http://www.yourwebsite.com/robots . txt

User-Agent

The user-agent area contains the name of the robot, robots or all robots.

Example 1:

User-agent: * In this example, the wildcard * means all robots.

Example 2:

User-agent: googlebot In this example, only Google’s robot is excluded.

Disallow

The next area of the robots.txt file is the Disallow area. In this
area you can exclude a robot or robots from indexing your folders,
images, html pages, scripts or other files.

Example 1:

Disallow: /cgi-bin In this example, only the cgi-bin folder is excluded.

Example 2:

Disallow: /query.html In this example, only the query.html file is excluded.

Exclude all bots in Robots.txt file

If you would like your entire site not to be indexed by any of the search engines you would put this in your robots text file:

User-agent: *
Disallow: /

If you want to exclude all of the robots from a certain directory on your website, your robots text file would look like this:

User-agent: *
Disallow: /images/

If you want to exclude the robot from indexing a certain file in a certain directory, the robots file would look like this:

User-agent: *
Disallow: /stuff/wacky.html

If you would like to keep a specific search engine robot from
indexing a specific file, the robots.txt file would look like this:

User-agent: googlebot
Disallow: /stuff/wacky.html

If you want to see more complex examples, of robots.txt files, view the robots.txt files of big Web sites:

http://www.cnn.com/
http://www.nytimes.com/
http://www.google.com/
http://www.ebay.com/

Leave a Reply