The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots.
Applicability. The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis), these guidelines do not need to apply.
The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too.
Back to robots.txt. If you have a robots.txt file, you’ll need to locate it in your site’s root directory. If you’re not used to poking around in source code, then it might be a little difficult to locate the editable version of your robots.txt file.
User-agent: Baiduspider Disallow: /baidu Disallow: /s? Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Googlebot Disallow: /baidu Disallow: /s?
The robots.txt file. The robots.txt file is a simple text file used to inform Googlebot about the areas of a domain that may be crawled by the search engine’s crawler and those that may not.
Webmaster tools available for Yahoo Search. You can manage how your website appears in Yahoo Search by using meta tags and robots.txt. Yahoo Search results come from the Yahoo web crawler (Slurp) and Bing's web crawler.