| Search engine listing delays have come to be called | | | | likes how |
| the Google Sandbox | | | | we changed the page structure to include a new |
| effect are actually true in practice at each of four top | | | | feature which |
| tier | | | | links to questions from several other article |
| search engines in one form or another. MSN, it seems | | | | pages.Slurp gets strangely inactive then alternately |
| has the | | | | hyperactive |
| shortest indexing delay at 30 days. This article is the | | | | for periods of time. The Yahoo crawler will look at 40 |
| second in a series following the spiders through a | | | | pages |
| brand new | | | | one day and then 4000 the next, then simply look at |
| web site beginning on May 11, 2005 when the site | | | | the home |
| was first | | | | page for a few days and then jump back in for 3000 |
| made live on that day under a newly purchased | | | | pages the |
| domain name.First Case Study ArticlePreviously we | | | | next day and back to only reviewing robots.txt for |
| looked at the first 35 days and detailed the | | | | two days. |
| crawling behavior of Googlebot, Teoma, MSNbot and | | | | Consistency is not a curse suffered by Slurp. Yahoo |
| Slurp as | | | | now shows |
| they traversed the pages of this new site. We | | | | 6 pages in their index, one an errors page and |
| discovered the | | | | another is a |
| each robot spider displays distinctly different behavior | | | | "index/of" page as we have not posted a home page |
| in | | | | to several |
| crawling frequency and similarly differing indexing | | | | subdomains. But Slurp has crawled easily 15,000 |
| patterns.For reference, there are about 15 to 20 new | | | | pages to date.Lessons learned in the first 60 days on |
| pages added to | | | | a new site follow:1) Google crawls 250 pages on first |
| the site daily, which are each linked from the home | | | | discovery of links to site. |
| page for | | | | Then they don't return until they find more links and |
| a day. Site structure is non-traditional with no | | | | crawl |
| categories | | | | slowly. Google has failed to index new domain for 60 |
| and a linking structure tied to author pages listing their | | | | days.2) Yahoo looks for errors pages and once they |
| articles as well as a "related articles" index varied by | | | | find bad links |
| linking to relevant pages containing similar content.So | | | | will crawl them ceaselessly until you tell them to stop |
| let's review where we are with each spider crawling | | | | it. |
| and | | | | Then won't crawl at all for weeks until crawling |
| look at pages crawled and compare pages indexed | | | | heavily |
| by engine.The AskJeeves spider, Teoma has crawled | | | | one day and lightly the next in random fashion.3) |
| most of the pages on | | | | MSNbot requires robots.txt files and once they decide |
| the site, yet indexes no pages 60 days later at this | | | | they |
| writing. | | | | like your site, may crawl too fast, requiring |
| This is clearly a site aging delay that's modeled on | | | | "crawl-delay" |
| Google's | | | | instructions in that robots.txt file. Implement |
| Sandbox behavior. Although the Teoma spider from | | | | immediately.4) Bad bots can strain resources and hit |
| has | | | | too many pages too |
| crawled more pages on this site than any other | | | | quickly until you tell them to stay out. We banned 3 |
| engine over a | | | | bots |
| 60 day period and appears to be tired of crawling as | | | | outright after they slammed our servers for a day or |
| they've | | | | two. |
| not returned since July 13 - their first break in 60 | | | | Noted "aipbot" crawled first then "BecomeBot" came |
| days.In the first two days, Googlebot gobbled up 250 | | | | along |
| pages and | | | | and then "Pbot" from crawled heavily looking |
| didn't return until 60 days later, but has not indexed | | | | for image files we don't have. Bad bots, stay out. Best |
| even | | | | to |
| a single page in 60 days since they made that initial | | | | implement robots.txt exclusions for all but top engines |
| crawl. | | | | if |
| But Googlebot is showing a renewed interest in | | | | their crawlers strain your server resources. We |
| crawling the | | | | considered |
| site since this crawling case study article was | | | | excluding the Chinese search engine named when |
| published | | | | they began crawling heavily early on. We don't |
| on several high traffic sites. Now Googlebot is looking | | | | expect much |
| at a | | | | traffic from China, but why exclude one billion people? |
| few pages each day. So far no more than about 20 | | | | Especially since Google is rumored to be considering |
| pages at a | | | | a |
| decidedly lackluster pace, a true "Crawl" that will keep | | | | possible purchase of as entry to Chinese market.The |
| it | | | | bottom line is that we've discovered all engines seem |
| occupied for years if continued that slowly.MSNbot | | | | to |
| crawled timidly for the first 45 days, looking over | | | | delay indexing of new domain names for at least |
| 30 to 50 pages daily, but not until they found a | | | | thirty days. |
| robots.txt | | | | Google so far has delayed indexing THIS new domain |
| file, which we'd neglected to post to the site for a | | | | for 60 |
| week and | | | | days since first crawling it. AskJeeves has crawled |
| then bobbled the ball as we changed site structure, | | | | thousands |
| then | | | | of pages, while indexing none of them. MSN indexes |
| failed to implement robots.txt in new subdomains until | | | | faster than |
| day | | | | all engines but requires robots.txt file. Yahoo's Slurp |
| 25 - and THEN MSNbot didn't return until day 30. If little | | | | crawls |
| else were discovered about initial crawls and indexing, | | | | on again off again for 60 days, but indexes only six of |
| we | | | | total |
| have seen that MSNbot relies heavily on that | | | | 15,000 or more pages crawled to date.We seem to |
| robots.txt file | | | | have settled that there is a clear indexing delay, |
| and proper implementation of that file will speed | | | | but whether this site specifically is "Sandboxed" and |
| crawling.MSNbot is now crawling with enthusiasm at | | | | whether |
| anywhere between | | | | delays apply universally is less clear. Many |
| 200 to 800 pages daily. As a matter of fact, we had | | | | webmasters claim |
| to use | | | | that they have been indexed fully within 30 days of |
| a "crawl-delay" command in the robots.txt file after | | | | first |
| MSNbot | | | | posting a new domain. We'd love to see others track |
| began hitting 6 pages per second last week. The | | | | spiders |
| MSN index now | | | | through new sites following launch to document their |
| shows 4905 pages 60 days into this experiment. | | | | results |
| Cached pages | | | | publicly so that indexing and crawling behavior are |
| change weekly. MSNbot has apparently found that it | | | | proven. |