Discover Tibetan Buddhism


60 Day Sandbox for Google & AskJeeves; MSN Indexes Quickest, Yahoo Next

Search engine listing delays have come to bechange weekly. MSNbot has apparently found
called  the  Google  Sandboxthat  it  likes  how
effect are actually true in practice at eachwe changed the page structure to include a
of  four  top  tiernew  feature  which
search engines in one form or another. MSN,links to questions from several other
it  seems  has  thearticle pages.Slurp gets strangely inactive
then  alternately  hyperactive
shortest indexing delay at 30 days. This
article  is  thefor periods of time. The Yahoo crawler will
look  at  40  pages
second in a series following the spiders
through  a  brand  newone day and then 4000 the next, then simply
look  at  the  home
web site beginning on May 11, 2005 when the
site  was  firstpage for a few days and then jump back in
for  3000  pages  the
made live on that day under a newly
purchased domain name.First Case Studynext day and back to only reviewing
ArticlePreviously we looked at the first 35robots.txt  for  two  days.
days  and  detailed  the
Consistency is not a curse suffered by
crawling behavior of Googlebot, Teoma,Slurp.  Yahoo  now  shows
MSNbot  and  Slurp  as
6 pages in their index, one an errors page
they traversed the pages of this new site.and  another  is  a
We  discovered  the
"index/of" page as we have not posted a home
each robot spider displays distinctlypage  to  several
different  behavior  in
subdomains. But Slurp has crawled easily
crawling frequency and similarly differing15,000 pages to date.Lessons learned in the
indexing patterns.For reference, there arefirst 60 days on a new site follow:1) Google
about  15  to  20  new  pages  added  tocrawls 250 pages on first discovery of links
to  site.
the site daily, which are each linked from
the  home  page  forThen they don't return until they find more
links  and  crawl
a day. Site structure is non-traditional
with  no  categoriesslowly. Google has failed to index new
domain for 60 days.2) Yahoo looks for errors
and a linking structure tied to author pagespages  and  once  they  find  bad  links
listing  their
will crawl them ceaselessly until you tell
articles as well as a "related articles"them  to  stop  it.
index  varied  by
Then won't crawl at all for weeks until
linking to relevant pages containing similarcrawling  heavily
content.So let's review where we are with
each  spider  crawling  andone day and lightly the next in random
fashion.3) MSNbot requires robots.txt files
look at pages crawled and compare pagesand  once  they  decide  they
indexed by engine.The AskJeeves spider, Teoma
has  crawled  most  of  the  pages  onlike your site, may crawl too fast,
requiring  "crawl-delay"
the site, yet indexes no pages 60 days later
at  this  writing.instructions in that robots.txt file.
Implement immediately.4) Bad bots can strain
This is clearly a site aging delay that'sresources  and  hit  too  many  pages  too
modeled  on  Google's
quickly until you tell them to stay out. We
Sandbox behavior. Although the Teoma spiderbanned  3  bots
from  has
outright after they slammed our servers for
crawled more pages on this site than anya  day  or  two.
other  engine  over  a
Noted "aipbot" crawled first then
60 day period and appears to be tired of"BecomeBot"  came  along
crawling  as  they've
and then "Pbot" from crawled heavily looking
not returned since July 13 - their first
break in 60 days.In the first two days,for image files we don't have. Bad bots,
Googlebot  gobbled  up  250  pages  andstay  out.  Best  to
didn't return until 60 days later, but hasimplement robots.txt exclusions for all but
not  indexed  eventop  engines  if
a single page in 60 days since they madetheir crawlers strain your server resources.
that  initial  crawl.We  considered
But Googlebot is showing a renewed interestexcluding the Chinese search engine named
in  crawling  thewhen
site since this crawling case study articlethey began crawling heavily early on. We
was  publisheddon't  expect  much
on several high traffic sites. Now Googlebottraffic from China, but why exclude one
is  looking  at  abillion  people?
few pages each day. So far no more thanEspecially since Google is rumored to be
about  20  pages  at  aconsidering  a
decidedly lackluster pace, a true "Crawl"possible purchase of as entry to Chinese
that  will  keep  itmarket.The bottom line is that we've
discovered  all  engines  seem  to
occupied for years if continued that
slowly.MSNbot crawled timidly for the firstdelay indexing of new domain names for at
45  days,  looking  overleast  thirty  days.
30 to 50 pages daily, but not until theyGoogle so far has delayed indexing THIS new
found  a  robots.txtdomain  for  60
file, which we'd neglected to post to thedays since first crawling it. AskJeeves has
site  for  a  week  andcrawled  thousands
then bobbled the ball as we changed siteof pages, while indexing none of them. MSN
structure,  thenindexes  faster  than
failed to implement robots.txt in newall engines but requires robots.txt file.
subdomains  until  dayYahoo's  Slurp  crawls
25 - and THEN MSNbot didn't return until dayon again off again for 60 days, but indexes
30.  If  littleonly  six  of  total
else were discovered about initial crawls15,000 or more pages crawled to date.We seem
and  indexing,  weto have settled that there is a clear
indexing  delay,
have seen that MSNbot relies heavily on that
robots.txt  filebut whether this site specifically is
"Sandboxed"  and  whether
and proper implementation of that file will
speed crawling.MSNbot is now crawling withdelays apply universally is less clear. Many
enthusiasm  at  anywhere  betweenwebmasters  claim
200 to 800 pages daily. As a matter of fact,that they have been indexed fully within 30
we  had  to  usedays  of  first
a "crawl-delay" command in the robots.txtposting a new domain. We'd love to see
file  after  MSNbotothers  track  spiders
began hitting 6 pages per second last week.through new sites following launch to
The  MSN  index  nowdocument  their  results
shows 4905 pages 60 days into thispublicly so that indexing and crawling
experiment.  Cached  pagesbehavior are proven.



1 A B 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84