I tend to assume that for all its flaws The Economist gets its facts right – at least on technical issues. But this article on How Google Works in their technology section recently repeats a popular misconception about search. The article says, ‘Google is thought to have several complete copies of the web distributed across servers in California and Virginia’ – whatever they do have it is nothing close to a complete copy of the web. Even if they had a complete index of the text of the first 100Kb of each page on the publicly spidered web (the most they would even claim) this would still miss the huge volume of available information that is stored in web-accessible databases (like the “British Telecom phone book”:http://www2.bt.com/edq_busnamesearch).
I believe that a search engine that managed to do a good job of searching this ‘invisible web’ alongside the ‘surface web’ would have a good shot at the number one spot.
P.S. While on the subject of search, here’s a tip – to get a (small) discount on your next Amazon purchase, check out their new A9 search engine.
In fact, recall reading somwhere that just about 10% of the pages on the Internet have been crawled by all the search engines combined.
Comment by iProceed.com — 8 October 2004 @ 1:56 am
Sounds like this study of the invisible web:
Bergman, M. K. (2001) “The Deep Web: Surfacing Hidden Value”, The Journal of Electronic Publishing, 7 (1). http://www.press.umich.edu/jep/07-01/bergman.html
I think the ratio is getting better, and notwithstanding what they say a lot of Invisible Web info is stuff most web searchers will never use but the general point is worth making.
Comment by David Brake — 8 October 2004 @ 8:31 am