60 Day Sandbox for Google & AskJeeves; MSN Indexes Quickest, Yahoo Next

Search engine listing delays have come to be calledlikes how
the Google Sandboxwe changed the page structure to include a new
effect are actually true in practice at each of four topfeature which
tierlinks to questions from several other article
search engines in one form or another. MSN, it seemspages.Slurp gets strangely inactive then alternately
has thehyperactive
shortest indexing delay at 30 days. This article is thefor periods of time. The Yahoo crawler will look at 40
second in a series following the spiders through apages
brand newone day and then 4000 the next, then simply look at
web site beginning on May 11, 2005 when the sitethe home
was firstpage for a few days and then jump back in for 3000
made live on that day under a newly purchasedpages the
domain name.First Case Study ArticlePreviously wenext day and back to only reviewing robots.txt for
looked at the first 35 days and detailed thetwo days.
crawling behavior of Googlebot, Teoma, MSNbot andConsistency is not a curse suffered by Slurp. Yahoo
Slurp asnow shows
they traversed the pages of this new site. We6 pages in their index, one an errors page and
discovered theanother is a
each robot spider displays distinctly different behavior"index/of" page as we have not posted a home page
into several
crawling frequency and similarly differing indexingsubdomains. But Slurp has crawled easily 15,000
patterns.For reference, there are about 15 to 20 newpages to date.Lessons learned in the first 60 days on
pages added toa new site follow:1) Google crawls 250 pages on first
the site daily, which are each linked from the homediscovery of links to site.
page forThen they don't return until they find more links and
a day. Site structure is non-traditional with nocrawl
categoriesslowly. Google has failed to index new domain for 60
and a linking structure tied to author pages listing theirdays.2) Yahoo looks for errors pages and once they
articles as well as a "related articles" index varied byfind bad links
linking to relevant pages containing similar content.Sowill crawl them ceaselessly until you tell them to stop
let's review where we are with each spider crawlingit.
andThen won't crawl at all for weeks until crawling
look at pages crawled and compare pages indexedheavily
by engine.The AskJeeves spider, Teoma has crawledone day and lightly the next in random fashion.3)
most of the pages onMSNbot requires robots.txt files and once they decide
the site, yet indexes no pages 60 days later at thisthey
writing.like your site, may crawl too fast, requiring
This is clearly a site aging delay that's modeled on"crawl-delay"
Google'sinstructions in that robots.txt file. Implement
Sandbox behavior. Although the Teoma spider fromimmediately.4) Bad bots can strain resources and hit
hastoo many pages too
crawled more pages on this site than any otherquickly until you tell them to stay out. We banned 3
engine over abots
60 day period and appears to be tired of crawling asoutright after they slammed our servers for a day or
they'vetwo.
not returned since July 13 - their first break in 60Noted "aipbot" crawled first then "BecomeBot" came
days.In the first two days, Googlebot gobbled up 250along
pages andand then "Pbot" from crawled heavily looking
didn't return until 60 days later, but has not indexedfor image files we don't have. Bad bots, stay out. Best
evento
a single page in 60 days since they made that initialimplement robots.txt exclusions for all but top engines
crawl.if
But Googlebot is showing a renewed interest intheir crawlers strain your server resources. We
crawling theconsidered
site since this crawling case study article wasexcluding the Chinese search engine named when
publishedthey began crawling heavily early on. We don't
on several high traffic sites. Now Googlebot is lookingexpect much
at atraffic from China, but why exclude one billion people?
few pages each day. So far no more than about 20Especially since Google is rumored to be considering
pages at aa
decidedly lackluster pace, a true "Crawl" that will keeppossible purchase of as entry to Chinese market.The
itbottom line is that we've discovered all engines seem
occupied for years if continued that slowly.MSNbotto
crawled timidly for the first 45 days, looking overdelay indexing of new domain names for at least
30 to 50 pages daily, but not until they found athirty days.
robots.txtGoogle so far has delayed indexing THIS new domain
file, which we'd neglected to post to the site for afor 60
week anddays since first crawling it. AskJeeves has crawled
then bobbled the ball as we changed site structure,thousands
thenof pages, while indexing none of them. MSN indexes
failed to implement robots.txt in new subdomains untilfaster than
dayall engines but requires robots.txt file. Yahoo's Slurp
25 - and THEN MSNbot didn't return until day 30. If littlecrawls
else were discovered about initial crawls and indexing,on again off again for 60 days, but indexes only six of
wetotal
have seen that MSNbot relies heavily on that15,000 or more pages crawled to date.We seem to
robots.txt filehave settled that there is a clear indexing delay,
and proper implementation of that file will speedbut whether this site specifically is "Sandboxed" and
crawling.MSNbot is now crawling with enthusiasm atwhether
anywhere betweendelays apply universally is less clear. Many
200 to 800 pages daily. As a matter of fact, we hadwebmasters claim
to usethat they have been indexed fully within 30 days of
a "crawl-delay" command in the robots.txt file afterfirst
MSNbotposting a new domain. We'd love to see others track
began hitting 6 pages per second last week. Thespiders
MSN index nowthrough new sites following launch to document their
shows 4905 pages 60 days into this experiment.results
Cached pagespublicly so that indexing and crawling behavior are
change weekly. MSNbot has apparently found that itproven.