Robots.txt explained

19 August 2006 8:40 AM
- Tutorials

Robots.txt explained in a nutshell:

Robots.txt is a useful tool when used to limit robot access to your site, such as content in work and sensitive information. At this time there is no endorsement by the Robots Exclusion Standard Working Group for an allow directive that would push robots to index portions of the site. This can still be initiated by linking internally and externally to your web site.

You may have noticed when reviewing your website usage statistics a reference to a file named 'robots.txt'. This is a basic configuration file that robots (search engine spiders) use to note which files and directories they should not index.

It is important to have a file called robots.txt in the root of your website (http://www.yourdomain.com/robots.txt). The basic use of the robots.txt is to note which files and directories the robots should not index.

There are 2 components to the syntax of the robots.txt file structure: the user agent line and the disallow line.

The syntax is as follows:

User-agent
A list of the robots visiting your site may be viewed in the stats logs of your website. A listing of user-agents may be viewed at http://www.robotstxt.org/wc/active.html

A specific robot may be referenced in the User-Agent syntax.

User-agent: Googlebot

A wildcard may also be used to match all robots.txt files.

User-agent: *

Disallow
This section of the syntax states which files and directories should not be indexed.

If you have a page named 'addresses.html' and you do not want the robot to index it, add the following line after the User-agent syntax:

Disallow: addresses.html

You may also limit the robot at the directory level. If the directory '/billing' on your website should be restricted from indexing, place the following text after the User-agent syntax.

Disallow: /billing/

A wildcard may also be used to note all files on a site.

Disallow: /

How do I make comments in my robots.txt file?
Coding comments may be placed in the robots.txt file by placing '#' before any text. An example is:

# specifies googlebot as the User-agent.
User-agent: Googlebot

It is good practice to have your comments on a separate line than the directive. Also, limit white space at the beginning of directive lines.

Putting robots.txt to use:
Now that we know the functionality of robots.txt, let's put it to practice.

If you want to allow ALL robots to visit ALL the pages on your website, you should use the "*" wildcard in your syntax in the robots.txt file.

User-agent: *
Disallow:

To disallow all indexing of your site, enter the following text in your robots.txt file:

User-agent: *
Disallow: /

Disallow robots from indexing pages within a directory named 'billing'.

User-agent: *
Disallow: /billing/

Disallow robots from indexing pages within directories named "billing" and "address".

User-agent: *
Disallow: /billing/
Disallow: /address/

Disallow googlebot from your site.

User-agent: googlebot
Disallow: /

Disallow googlebot from accessing you contact page (contact-us.html).

User-agent: googlebot
Disallow: /contact-us.html

Disallow googlebot from accessing you contact page (about-us/contact-us.html).

User-agent: googlebot
Disallow: /about-us/contact-us.html

Limiting crawlers

If you have a lot of links on your site (for example, if you have forums), chances are that Yahoo's bot may crawl the links too quickly and end up using up a lot of bandwidth and other resources. To prevent that, add these lines to your robots.txt file:

User-agent: Slurp
Crawl-delay: 60

This will tell Yahoo Slurp to only crawl one page every 60 seconds.

Robots.txt explained

Related Articles

Comments

Categories

Tags