Search Engine Spiders Lost Without Guidance - Post This Sign!

Softarea51.com is your source for all the latest computer technology and software related press releases.
Browse our archive for more press releases!

Released on: 23, September 2006
, Author: Cpxclick.com Inc
, Audience: Internet related

Recommended: Click here to run a Free driver update scan »


To see the proper format for a somewhat standard robots.txtfile look directly below. That file should be at the root of the domain because thatis where the crawlers expect it to be, not in some secondary directory.

Below is the proper format for a robots.txt file ----->

User-agent: *Disallow: /cgi-bin/Disallow: /images/Disallow: /group/

User-agent: msnbotCrawl-delay: 10

User-agent: TeomaCrawl-delay: 10

User-agent: SlurpCrawl-delay: 10

User-agent: aipbotDisallow: /

User-agent: BecomeBotDisallow: /

User-agent: psbotDisallow: /

--------> End of robots.txt file

This tiny text file is saved as a plain text document and ALWAYS with the name"robots.txt" in the root of your domain.

A quick review of the listed information from the robots.txt file above follows. The"User Agent: MSNbot" is from MSN, Slurp is from Yahoo and Teoma is from AskJeeves.The others listed are "Bad" bots that crawl very fast and to nobody's benefit buttheir own, so we ask them to stay out entirely. The * asterisk is a wild card thatmeans "All" crawlers/spiders/bots should stay out of that group of files ordirectories listed.

The bots given the instruction "Disallow: /" means they should stay out entirely andthose with "Crawl-delay: 10" are those that crawled our site too quickly and causedit to bog down and overuse the server resources. Google crawls more slowly than theothers and doesn't require that instruction, so is not specifically listed in theabove robots.txt file. Crawl-delay instruction is only needed on very large siteswith hundreds or thousands of pages. The wildcard asterisk * applies to allcrawlers, bots and spiders, including Googlebot.

Those we provided that "Crawl-delay: 10" instruction to were requesting as many as 7pages every second and so we asked them to slow down. The number you see is secondsand you can change it to suit your server capacity, based on their crawling rate.Ten seconds between page requests is far more leisurely and stops them from askingfor more pages than your server can dish up.

(You can discover how fast robots and spiders are crawling by looking at your rawserver logs - which show pages requested by precise times to within a hundredth of asecond - available from your web host or ask your web or IT person. Your server logscan be found in the root directory if you have server access, you can usuallydownload compressed server log files by calendar day right off your server. You'llneed a utility that can expand compressed files to open and read those plain textraw server log files.)

To see the contents of any robots.txt file just type robots.txt after any domainname. If they have that file up, you will see it displayed as a text file in yourweb browser. Click on the link below to see that file for Amazon.com

http://www.Amazon. com/robots.txt

You can see the contents of any website robots.txt file that way.

The robots.txt shown above is what we currently use at Publish101 Web ContentDistributor, just launched in May of 2005. We did an extensive case study andpublished a series of articles on crawler behavior and indexing delays known as theGoogle Sandbox. That Google Sandbox Case Study is highly instructive on many levelsfor webmasters everywhere about the importance of this often ignored little textfile.

One thing we didn't expect to glean from the research involved in indexing delays(known as the Google Sandbox) was the importance of robots.txt files to quick andefficient crawling by the spiders from the major search engines and the number ofheavy crawls from bots that will do no earthly good to the site owner, yet crawlmost sites extensively and heavily, straining servers to the breaking point withrequests for pages coming as fast as 7 pages per second.

We discovered in our launch of the new site that Google and Yahoo will crawl thesite whether or not you use a robots.txt file, but MSN seems to REQUIRE it beforethey will begin crawling at all. All of the search engine robots seem to request thefile on a regular basis to verify that it hasn't changed.

Then when you DO change it, they will stop crawling for brief periods and repeatedlyask for that robots.txt file during that time without crawling any additional pages.(Perhaps they had a list of pages to visit that included the directory or files youhave instructed them to stay out of and must now adjust their crawling schedule toeliminate those files from their list.)

Most webmasters instruct the bots to stay out of "image" directories and the"cgi-bin" directory as well as any directories containing private or proprietaryfiles intended only for users of an intranet or password protected sections of yoursite. Clearly, you should direct the bots to stay out of any private areas that youdon't want indexed by the search engines.

The importance of robots.txt is rarely discussed by average webmasters and I've evenhad some of my client business' webmasters ask me what it is and how to implement itwhen I tell them how important it is to both site security and efficient crawling bythe search engines. This should be standard knowledge by webmasters at substantialcompanies, but this illustrates how little attention is paid to use of robots.txt.

The search engine spiders really do want your guidance and this tiny text file isthe best way to provide crawlers and bots a clear signpost to warn off trespassersand protect private property - and to warmly welcome invited guests, such as the bigthree search engines while asking them nicely to stay out of private areas.


Source: Express-Press-Release.com
Related downloads


In search of a little guidance? Sometimes we need a little insight into what this life is about. Over the centuries people have used forms of Oracles when they need some sort of guidance in their life. Try our Tarot,I Ching, Runes and Yes/No package.

DrapeFX PRO is the professional version of our DrapeFX software solution allowing you to drape color and texture into your graphic files. When you wish to create realism in your results and achieve professional quality images.

This link engine lets an administrator post URLs to an administration page along with titles and descriptions. They can be sorted by category in a plain view, or searched and filtered based on keywords and categories.

Fass is a (f)orum (ass)istant, providing an enhanced, offline alternative to the post editor in forum software and an automated, template-based post generator. Designed for use with Invision Power Board but compatible with most forum software.

PageRank provides search engine optimization tools to check search engine ranking and tips on increasing Web traffic. Includes links to news on specific search engines

PageRank provides search engine optimization tools to check search engine ranking and tips on increasing Web traffic. Includes links to news on specific search engines

PageRank provides search engine optimization tools to check search engine ranking and tips on increasing Web traffic. Includes links to news on specific search engines

Alasend is a secure password manager and a universal automatic login tool.Easy SSO(Single sign-on)software. Use Alasend one single software to complete all the login tasks (SSO)Applications/Websites/Games with one click!Secure and Efficient!

Search Engine Promotion Tools automates your search engine marketing. Everything's included: automatic search engines submission scheduler, meta tag generator, link checker, web ranking, keyword creator, doorway page generator and link promoter.

Get top search engine rankings with ROBO Optimizer Pro Search Engine Optimization. This software will take an existing webpage and optimizes it for specific keywords through a 11 step wizard. Extremely easy to use, fast with built in WYSIWYG HTML editor.
Softarea51.com RSS Feed

Get RSS updates on latest computer technology and software related press releases Subscribe to Latest Press Releases RSS feed    Subscribe



You are welcome to include these headlines in your own pages. If you want to find out how to parse this RSS file please read our tutorial How to parse RSS feeds with PHP.