A simple question arises many times - what is
robots.txt? If we do not go by the complexities of the terms, the “.txt” implies that it is a text file. The main question is – why use this file or how is it a secret weapon? First it is a file that is used to direct or instruct the search engine spiders or the robots specifically what to crawl for indexing in search engine database and what not. And secondly it is a secret weapon because using this robots.txt file you can protect your private and sensitive files being visible to the world audience.
Now comes the question how this
robots.txt file works and what are the most important protection offered to us! Let us see the things in little details:
Many a times some work on a site is still left but site has to be made live to work online. Here with the specific comment we can stop crawlers to crawl the site.
The command is
User-agent: *
Disallow: /
Here the (*) denote all user agents or the crawlers and (/) indicates to all the directories. So it is clear that all the robots are disallowed to crawl all directories.
Sometimes we find that a few directories are too much important and confidential to us. And necessarily we will not let those be visible to others. Suppose the admin panel has to be disallowed so the command is going to be:
User-agent: *
Disallow: /admin/
Here the (/admin/) indicates to robots that those should avoid crawling the admin pages of the site. This way any of our specific files may remain confidential to the eyes of common visitors.
Similarly we may wish to deny the crawling of some specific search engine spiders. At the same time we may instruct specific one to crawl a file. For the first case we have to write:
User-agent: Googlebot
Disallow: /
Here only Googlebot is denied permission to crawl any directory. And all other spiders will continue to crawl the directories. In the second case:
User-agent: Googlebot
Allow: /about_us.html/
User-agent: *
disallow: /about_us.html/
In this case all other crawlers will not crawl the (about_us.html) page but only Googlebot will index it in its database.With specific disallow instruction in
robots.txt file, we may save spending of extra bandwidth and save money. We know bandwidth is limited. And if it exceeds the sites goes down or invisible. And to get extra bandwidth we need to buy it from the service providers. With a little effort we can save some spending of bandwidth. We know if our site is large and complex enough and contains too many images, javascript, long cascading style sheet, the crawlers will take time to crawl those and all the time new addition is made, crawlers go on reading them again and again. It is better to block the crawlers to crawl those images and save bandwidth. For example:
User-agent: *
disallow: /images/
It is found that sometimes we need to have print version of the web pages. And naturally those pages are duplicates of the main content. Here it is found that a little careless attitude may lead banning of our site. Actually, if we do not disallow the print version of the web pages in the robots.txt file, the crawlers will index those in the search engine database and search engines are going to consider those as a means of spamming with same content. So avoid ban by disallowing crawl of the print versions.
However, in the above discussion I have tried to just hint out few main usage of the robots.txt file. There are other comments that can really come in use to protect our documents or private files. But here we have to be little careful in placing the robots.txt file. Only because the robots only try to find out the instruction file namely robots.txt directly from the main root. So placing the file in sub directories or directories does never come in use. Doing such way only allows the spiders to crawl the files entirely. For example:
If you keep the robot.txt file like this: http://www.infowaylive.com/web-development-portfolio/robots.txt for your little ignorence, robots crawl everything, whatever confidential data is there. So you need to place the robots.txt file under the root: http://www.infowaylive.com/robots.txt and the robots will find their instructions.
One important thing to include in robots.txt file is sitemap of the website. It helps the robots to crawl the entire site easily and the site is properly indexed in search engines.
So we can say,
robots.txt file is vey much important for a website to be include at the time of going live. Without this file the site becomes impaired to the search engines. But last, certainly not the least is – try not to keep your confidential files without login security or preprogrammed security code. It is because some spamming robots do not follow the instruction in the robots.txt file. As a result your confidentially is lost. But as I told before this file is a weapon that comes in good use to make website well visible in good search engines, a careful use of it can provide us good results for sure, as getting good position in search engine result pages; spiders are the medium to communicate sites with search engines. So we must lead the robots with proper guidance.
So now it is clear that in making website successful after its development, robots.txt plays a major role. But it is same way true that the developers must know the proper use of the file. And to get the best result from the use of robots.txt file, you have to choose the experienced and knowledgeable developers. Infoway, the premium
PHP developer India is always with you to help in this regard. So do not get tensed, feel free to contact us if you are having any problem with the crawlability issue of your website.