As I “posted earlier”:http://blog.org/archives/001126.html I would like to find a way to make a random sample of home pages from the UK. As it turns out if you search for “personal home page” and specify you are only interested in UK pages, Google and Yahoo will give you a selection that includes lots of home pages (the UK versions of both understand whether sites are UK or not though the algorithm is not perfect). But I worry a little that there are lots of home pages that do not include the text ‘home page’ prominently and that they might actually tend to be a different kind of home page (so excluding them tacitly might skew the results).
I also found that the two largest ISPs in the UK (I think) – “AOL UK”:http://hometown.aol.co.uk/mt.ssp?c=9011000 and “Wanadoo”:http://www.wanadoo.co.uk/sitebuilder/search.htm (was Freeserve) have pages where you can search home pages created by their members. If you search these for a common word like “the” you can also get a seemingly random sample but this might be tainted by any demographic skew in the kind of people who choose to use those tools. What do you think of using that as a method?
Are there other ways of sampling by keyword you could suggest? Any articles about web page sampling you can recommend?
P.S. I came across an “attempt at automating page classification”:http://students.iiit.ac.in/~kranthi/professional/papers/ieee_wpcds_1.shtml which the authors claimed works but unless I could somehow run it myself on a collection of UK URLs (and defend its reliability) it probably wouldn’t be of much help. I also ran across a second paper on “automated web page classification”:http://csdl.computer.org/comp/trans/tk/2004/01/k0070abs.htm but I couldn’t access it and it didn’t look as if it could help in any case unless I was trying to build my own search engine.