| Search engine listing delays have come to be | | | | change weekly. MSNbot has apparently found |
| called the Google Sandbox | | | | that it likes how |
| | | | |
| effect are actually true in practice at each | | | | we changed the page structure to include a |
| of four top tier | | | | new feature which |
| | | | |
| search engines in one form or another. MSN, | | | | links to questions from several other |
| it seems has the | | | | article pages.Slurp gets strangely inactive |
| | | | then alternately hyperactive |
| shortest indexing delay at 30 days. This | | | | |
| article is the | | | | for periods of time. The Yahoo crawler will |
| | | | look at 40 pages |
| second in a series following the spiders | | | | |
| through a brand new | | | | one day and then 4000 the next, then simply |
| | | | look at the home |
| web site beginning on May 11, 2005 when the | | | | |
| site was first | | | | page for a few days and then jump back in |
| | | | for 3000 pages the |
| made live on that day under a newly | | | | |
| purchased domain name.First Case Study | | | | next day and back to only reviewing |
| ArticlePreviously we looked at the first 35 | | | | robots.txt for two days. |
| days and detailed the | | | | |
| | | | Consistency is not a curse suffered by |
| crawling behavior of Googlebot, Teoma, | | | | Slurp. Yahoo now shows |
| MSNbot and Slurp as | | | | |
| | | | 6 pages in their index, one an errors page |
| they traversed the pages of this new site. | | | | and another is a |
| We discovered the | | | | |
| | | | "index/of" page as we have not posted a home |
| each robot spider displays distinctly | | | | page to several |
| different behavior in | | | | |
| | | | subdomains. But Slurp has crawled easily |
| crawling frequency and similarly differing | | | | 15,000 pages to date.Lessons learned in the |
| indexing patterns.For reference, there are | | | | first 60 days on a new site follow:1) Google |
| about 15 to 20 new pages added to | | | | crawls 250 pages on first discovery of links |
| | | | to site. |
| the site daily, which are each linked from | | | | |
| the home page for | | | | Then they don't return until they find more |
| | | | links and crawl |
| a day. Site structure is non-traditional | | | | |
| with no categories | | | | slowly. Google has failed to index new |
| | | | domain for 60 days.2) Yahoo looks for errors |
| and a linking structure tied to author pages | | | | pages and once they find bad links |
| listing their | | | | |
| | | | will crawl them ceaselessly until you tell |
| articles as well as a "related articles" | | | | them to stop it. |
| index varied by | | | | |
| | | | Then won't crawl at all for weeks until |
| linking to relevant pages containing similar | | | | crawling heavily |
| content.So let's review where we are with | | | | |
| each spider crawling and | | | | one day and lightly the next in random |
| | | | fashion.3) MSNbot requires robots.txt files |
| look at pages crawled and compare pages | | | | and once they decide they |
| indexed by engine.The AskJeeves spider, Teoma | | | | |
| has crawled most of the pages on | | | | like your site, may crawl too fast, |
| | | | requiring "crawl-delay" |
| the site, yet indexes no pages 60 days later | | | | |
| at this writing. | | | | instructions in that robots.txt file. |
| | | | Implement immediately.4) Bad bots can strain |
| This is clearly a site aging delay that's | | | | resources and hit too many pages too |
| modeled on Google's | | | | |
| | | | quickly until you tell them to stay out. We |
| Sandbox behavior. Although the Teoma spider | | | | banned 3 bots |
| from has | | | | |
| | | | outright after they slammed our servers for |
| crawled more pages on this site than any | | | | a day or two. |
| other engine over a | | | | |
| | | | Noted "aipbot" crawled first then |
| 60 day period and appears to be tired of | | | | "BecomeBot" came along |
| crawling as they've | | | | |
| | | | and then "Pbot" from crawled heavily looking |
| not returned since July 13 - their first | | | | |
| break in 60 days.In the first two days, | | | | for image files we don't have. Bad bots, |
| Googlebot gobbled up 250 pages and | | | | stay out. Best to |
| | | | |
| didn't return until 60 days later, but has | | | | implement robots.txt exclusions for all but |
| not indexed even | | | | top engines if |
| | | | |
| a single page in 60 days since they made | | | | their crawlers strain your server resources. |
| that initial crawl. | | | | We considered |
| | | | |
| But Googlebot is showing a renewed interest | | | | excluding the Chinese search engine named |
| in crawling the | | | | when |
| | | | |
| site since this crawling case study article | | | | they began crawling heavily early on. We |
| was published | | | | don't expect much |
| | | | |
| on several high traffic sites. Now Googlebot | | | | traffic from China, but why exclude one |
| is looking at a | | | | billion people? |
| | | | |
| few pages each day. So far no more than | | | | Especially since Google is rumored to be |
| about 20 pages at a | | | | considering a |
| | | | |
| decidedly lackluster pace, a true "Crawl" | | | | possible purchase of as entry to Chinese |
| that will keep it | | | | market.The bottom line is that we've |
| | | | discovered all engines seem to |
| occupied for years if continued that | | | | |
| slowly.MSNbot crawled timidly for the first | | | | delay indexing of new domain names for at |
| 45 days, looking over | | | | least thirty days. |
| | | | |
| 30 to 50 pages daily, but not until they | | | | Google so far has delayed indexing THIS new |
| found a robots.txt | | | | domain for 60 |
| | | | |
| file, which we'd neglected to post to the | | | | days since first crawling it. AskJeeves has |
| site for a week and | | | | crawled thousands |
| | | | |
| then bobbled the ball as we changed site | | | | of pages, while indexing none of them. MSN |
| structure, then | | | | indexes faster than |
| | | | |
| failed to implement robots.txt in new | | | | all engines but requires robots.txt file. |
| subdomains until day | | | | Yahoo's Slurp crawls |
| | | | |
| 25 - and THEN MSNbot didn't return until day | | | | on again off again for 60 days, but indexes |
| 30. If little | | | | only six of total |
| | | | |
| else were discovered about initial crawls | | | | 15,000 or more pages crawled to date.We seem |
| and indexing, we | | | | to have settled that there is a clear |
| | | | indexing delay, |
| have seen that MSNbot relies heavily on that | | | | |
| robots.txt file | | | | but whether this site specifically is |
| | | | "Sandboxed" and whether |
| and proper implementation of that file will | | | | |
| speed crawling.MSNbot is now crawling with | | | | delays apply universally is less clear. Many |
| enthusiasm at anywhere between | | | | webmasters claim |
| | | | |
| 200 to 800 pages daily. As a matter of fact, | | | | that they have been indexed fully within 30 |
| we had to use | | | | days of first |
| | | | |
| a "crawl-delay" command in the robots.txt | | | | posting a new domain. We'd love to see |
| file after MSNbot | | | | others track spiders |
| | | | |
| began hitting 6 pages per second last week. | | | | through new sites following launch to |
| The MSN index now | | | | document their results |
| | | | |
| shows 4905 pages 60 days into this | | | | publicly so that indexing and crawling |
| experiment. Cached pages | | | | behavior are proven. |
| | | | |