Brief about Robots Exclusion Standard and Guidelines to Create Robots.txt File

Robots.txt File

Recently one of my friends emailed me a query that he doesn’t know the concept of robots exclusion standard or robots exclusion protocol and also he wants information about how to create the robots.txt file. This is just one example but there are many SEO professionals who really don’t know about the robot exclusion standards or robots.txt file.

So considering all queries about the robots exclusion standard or robots.txt file; in this article Seogdk sorts out the information about the robots exclusion protocol and gives guidelines to create the robots.txt file.

First of all don’t be confused about the terms robots exclusion standard, robots exclusion protocol, and robots.txt file because these three terms are the same. Basically, these are the guidelines to keep crawlers in line. The robots.txt file is simply defined as the file that is used to tell robots and crawlers what not to crawl on your website. The robots.txt file is the actual component that you will work with. It is a text-based document that should be included in the root of your domain and it essentially contains directions to any crawler that comes to your website about what they are and are not allowed to index.

Every search engine has its own crawler with a specific name and if you want to see the name of the crawler then just check your web server log; you will probably see that name. Below see the list of different search engines with their crawler names:

-Google – Googlebot
-Bing – Bingbot
-Yahoo Search – Yahoo!Slurp
-MSN - Msnbot
-Baidu – Baiduspider
-Yandex – Yandexbot
-Alexa – ia_archiver
-Ask – Teoma
-Searchsight – SearchSight
-AltaVista – Scooter
-Guruji – GurujiBot
-Goo – Ichiro
-LookSmart – FurlBot
-FyberSearch – FyberSpider
-SiteSell – SBIder

Guidelines to Create Robots.txt File

1. To communicate with the crawler you need a particular syntax that it can understand. Below see the basic form of the syntax:

User-agent: *
Disallow: /

Above both lines are mandatory when you create the robots.txt file.

2. The first line User-agent:, tells a crawler what user agent you are commanding. The asterisk (*) denotes that all crawlers are covered but you can specify a single crawler or even multiple crawlers.

3. The second line Disallow:, tells the crawler what it is not allowed to access. The slash (/) denotes “all directories.” So in the previous code example, the robots.txt file is mainly saying that “all crawlers are to ignore all directories.”

4. When you create a robots.txt file always remember to include a colon (:) after the ‘User-agent’ indicator and after the ‘Disallow’ indicator. The colon denotes that important information follows to which the creator should pay attention.


5. If you want all crawlers to ignore specific directories then you simply mention a particular directory name as below:

User-agent: *
Disallow: /private/

As well as you can take one step further and tell all crawlers to ignore multiple directories as below:

User-agent: *
Disallow: /private/
Disallow: /public/
Disallow: /program/links.html

It means that the text tells the crawler to ignore private directories, public directories, and program directories that contain links that are not accessed by the crawler.

6. One thing always keeps in mind about crawlers is that they read the robots.txt file from top to bottom and as soon as they find a guideline that applies to them then they stop reading and begin crawling your website. So be careful about writing when you are commanding multiple crawlers with your robots.txt file.

7. Below text format totally wrong to write a robots.txt file:

User-agent: *
Disallow: /private/

User-agent: CrawlerName
Disallow: /private/
Disallow: /program/links.html

First, this text tells crawlers that all crawlers should ignore the ‘private’ directories. So every crawler reading that file will automatically ignore the ‘private’ files. But you have also told a particular crawler denoted by ‘CrawlerName’ to disallow both ‘private’ directories and ‘program’ directories that contain links. The problem is that the specified crawler will never get that message because it has already read that all crawlers should ignore the ‘private’ directories.

8. When you want to command multiple crawlers then you need to first begin by naming the crawlers you want to control. Only after they have been named should you leave your instructions for all crawlers. After written correctly the previous code should look like below:

User-agent: CrawlerName
Disallow: /private/
Disallow: /program/links.html

User-agent: *
Disallow: /private/

9. You view the robots.txt file for any website that has one by adding the robots.txt extension to the base URL of the website. For example, yourwebsitename.com/robots.txt will display a page that shows you the text file guiding robots for that website.

10. If you use a blank robots.txt file then crawlers automatically assume an empty file means you don’t want your website to be crawled. So using a blank robots.txt file is the best way to keep you out of search engine results.

Conclusion

From the above information, you can easily create a robots.txt file and if you have certain pages or links that you want the crawler to ignore then you can achieve this without causing the crawler to ignore a whole website. Additionally, you can find a complete list along with the text of the robots exclusion standard document on the Web Robots Pages. So friends convey your feedback about this article through your comments and emails till then enjoy your life.....!!!


Gangadhar Kulkarni

Gangadhar Kulkarni is an internet marketing expert and consultant having extensive experience in digital marketing. He is also the founder of Seogdk and Director at DigiTechMantra Solutions, a one-stop shop for all that your website needs. It provides you cost-effective and efficient content writing and digital marketing services.

No comments:

Post a Comment