Brief about Robots Exclusion Standard and Guidelines to Create Robots.txt File

Posted By
Don't get caught plagiarizing
Robots.txt File
Recently one of my friends emailed me a query that he doesn’t know the concept of robots exclusion standard or robots exclusion protocol and also he wants information about how to create robots.txt file. This is just one example but there are many SEO peoples who really don’t know about robot exclusion standard or robots.txt file.

So considering all queries about robots exclusion standard or robots.txt file; in this article Seogdk sort out the information about robots exclusion protocol and gives guidelines to create robots.txt file.

First of all don’t be confused about the terms robots exclusion standard, robots exclusion protocol and robots.txt file because these three terms are same. Basically these are the guidelines to keep crawlers in line. The robots.txt file simply defined as the file that is used to tell robots and crawlers what not to crawl on your website. The robots.txt file is the actual component that you will work with. It is text based document that should be included in the root of your domain and it essentially contains directions to any crawler that comes to your website about what they are and are not allowed to index.

Every search engine has their own crawler with specific name and if you want to see name of crawler then just check your web server log; you will probably see that name. Below see the list of different search engines with their crawler names:

-Google – Googlebot
-Bing – Bingbot
-Yahoo Search – Yahoo!Slurp
-MSN - Msnbot
-Baidu – Baiduspider
-Yandex – Yandexbot
-Alexa – ia_archiver
-Ask – Teoma
-Searchsight – SearchSight
-AltaVista – Scooter
-Guruji – GurujiBot
-Goo – ichiro
-LookSmart – FurlBot
-FyberSearch – FyberSpider
-SiteSell – SBIder

Guidelines to Create Robots.txt File
Robots Exclusion Protocol
1. To communicate with the crawler you need a particular syntax that it can understand. Below see the basic form of the syntax:

User-agent: *
Disallow: /

Above both lines are mandatory when you create robots.txt file.

2. The first line User-agent:, tells a crawler what user agent you are commanding. The asterisk (*) denotes that all crawlers are covered but you can specify a single crawler or even multiple crawlers.

3. The second line Disallow:, tells the crawler what it is not allowed to access. The slash (/) denotes “all directories.” So in the previous code example the robots.txt file is mainly saying that “all crawlers are to ignore all directories.”

4. When you creating robots.txt file always remember to include colon (:) after the ‘User-agent’ indicator and after the ‘Disallow’ indicator. The colon denotes that important information follows to which the creator should pay attention.

5. If you want to all crawlers to ignore specific directories then you simply mention particular directory name as below:

User-agent: *
Disallow: /private/

As well as you can take one step further and tell all crawlers to ignore multiple directories as below:

User-agent: *
Disallow: /private/
Disallow: /public/
Disallow: /program/links.html

It means that the text tells the crawler to ignore private directories, public directories and program directories that contains links which are not accessed by the crawler.

6. One thing always keep in mind about crawlers is that they read the robots.txt file from top to bottom and as soon as they find a guideline that applies to them then they stop reading and begin crawling your website. So be careful about to write when you are commanding multiple crawlers with your robots.txt file.

7. Below text format totally wrong to write robots.txt file:

User-agent: *
Disallow: /private/

User-agent: CrawlerName
Disallow: /private/
Disallow: /program/links.html

First this text tells crawlers that all crawlers should ignore the ‘private’ directories. So every crawler reading that file will automatically ignore the ‘private’ files. But you have also told a particular crawler denoted by ‘CrawlerName’ to disallow both ‘private’ directories and ‘program’ directories which contains links. The problem is that the specified crawler will never get that message because it has already read that all crawlers should ignore the ‘private’ directories.

8. When you want to command multiple crawlers then you need to first begin by naming the crawlers you want to control. Only after they have been named should you leave your instructions for all crawlers. After written correctly the previous code should look like below:

User-agent: CrawlerName
Disallow: /private/
Disallow: /program/links.html

User-agent: *
Disallow: /private/

9. You view the robots.txt file for any website that has one by adding the robots.txt extension to the base URL of the website. For example www.websitename.com/robots.txt will display a page that shows you the text file guiding robots for that website.

10. If you use blank robots.txt file then crawlers automatically assumes an empty file means you don’t want your website to be crawled. So using blank robots.txt file is a best way to keep you out of search engine results.

Conclusion

From above information you can easily create robots.txt file and if you have certain pages or links that you want the crawler to ignore then you can achieve this without causing the crawler to ignore a whole website. Additionally, you can find a complete list along with the text of the robots exclusion standard document on the Web Robots Pages. So friends convey your feedback about this article through your comments and emails till then enjoy your life.....!!!

-----------------------------------------------------------------------------------------------------------------------------
Author Bio
-----------------------------------------------------------------------------------------------------------------------------

Gangadhar Kulkarni is a Internet Marketing Professional having extensive experience in SEO and SMO. He is also founder of seogdk where he shares information about SEO, SMO, SEM, blogging and web technologies by way of articles. For more information catch him on Facebook | Twitter | LinkedIn | G+ | Pinterest | TSU 




-----------------------------------------------------------------------------------------------------------------------------

4 comments:

  1. To be honest, the Robot.txt file and its features is one of the practices I still find difficult to master. Its good to have the explanations shared in this post.

    At least, some good explanations have been proffered on the User-Agent and Disallow syntaxes.


    I have left this comment in kingged.com - the content syndication and social marketing platform for Internet marketers, where this post was shared.

    Sunday - kingged.com contributor

    http://kingged.com/robots-exclusion-standard-guidelines-create-robotstxt-file/

    ReplyDelete
    Replies
    1. Thanks Sunday,

      Actually the perception behind writing this article is that I realized many of us not aware about robots exclusion standard or robots.txt file so focused some key points which are important to explore this concept.

      Delete
  2. Useful info That Share By You

    Thanks For Sharing It With Us.

    ReplyDelete
    Replies
    1. Welcome Piyush.....!!!

      Glad know you found this information useful and thanks for adding feedback about this article......Have a great weekend......!!!

      Delete