Weblog on the Internet and public policy, journalism, virtual community, and more from David Brake, a Canadian academic, consultant and journalist

Archive forNovember 27th, 2002 | back to home

27 November 2002
Filed under:About the Internet,Academia at11:53 pm

I’m trying to figure out what %age of the WWW (roughly) is covered by any one search engines and/or a reasonable selection of several.

For starters in March 2002 in a comparison of ten search engines, half of the pages found in 4 sample searches were only found by one search engine and another 20% were only found by two, which suggests to me that the proportion of the total number of pages indexed by at least some of these search engines is low. What we don’t know of course is how many “public” web pages there are out there that none of the search engines find.

The OCLC indicates that the number of web servers has roughly tripled since 1999 based on random IP sampling. I chose 1999 because according to a Science-refereed paper in that year – the only study that gives a figure for number of web pages I trust so far – there were around 800m web pages around at that time. Which gives a guesstimate of 2.4bn pages now.

This has to be low though both because of the overlap figure and because Google says it is indexing 3Bn+ pages. (of course avg number of pages per site has probably risen sharply).

Anyway, is there enough information here or elsewhere any of you out there are aware of which can give me the information I am looking for?! What sort of additional information would help you to calculate a guess? I thought this would be something someone out there would be keeping track of but it seems not.

If you have any ideas, please contact me ASAP (or if you prefer post something via comments below).

P.S. There’s lots of good search engine related information at searchenginewatch and searchengineshowdown but nothing that relates to this particular issue since the 1999 piece. I am not that surprised as it is more of a theoretical than a practical concern for most web surfers and website marketers.