As I “posted earlier”:https://blog.org/archives/001126.html I would like to find a way to make a random sample of home pages from the UK. As it turns out if you search for “personal home page” and specify you are only interested in UK pages, Google and Yahoo will give you a selection that includes lots of home pages (the UK versions of both understand whether sites are UK or not though the algorithm is not perfect). But I worry a little that there are lots of home pages that do not include the text ‘home page’ prominently and that they might actually tend to be a different kind of home page (so excluding them tacitly might skew the results).
I also found that the two largest ISPs in the UK (I think) – “AOL UK”:http://hometown.aol.co.uk/mt.ssp?c=9011000 and “Wanadoo”:http://www.wanadoo.co.uk/sitebuilder/search.htm (was Freeserve) have pages where you can search home pages created by their members. If you search these for a common word like “the” you can also get a seemingly random sample but this might be tainted by any demographic skew in the kind of people who choose to use those tools. What do you think of using that as a method?
Are there other ways of sampling by keyword you could suggest? Any articles about web page sampling you can recommend?
P.S. I came across an “attempt at automating page classification”:http://students.iiit.ac.in/~kranthi/professional/papers/ieee_wpcds_1.shtml which the authors claimed works but unless I could somehow run it myself on a collection of UK URLs (and defend its reliability) it probably wouldn’t be of much help. I also ran across a second paper on “automated web page classification”:http://csdl.computer.org/comp/trans/tk/2004/01/k0070abs.htm but I couldn’t access it and it didn’t look as if it could help in any case unless I was trying to build my own search engine.
Not quite sure as to what are you trying to do but if you want to query Google for only UK sites
you can try querying Google with site:.co.uk
it should give you a 1000 .co.uk sites!
Comment by Jim Richards — 21 July 2004 @ 5:26 pm
If you query Google UK with “give only UK sites” set you get an even better group (because it includes .com pages that are nonetheless from the UK). The real problem is just that searching for “personal home page” cuts out (obviously) anyone whose home page doesn’t include those words!
Comment by David Brake — 21 July 2004 @ 5:35 pm
Yes but how do you define a personal home page. Maybe you could say a page was a personal home page if it contains the words “personal home page”?
There’s a related problem with e-mail addresses, how do you define a personal e-mail address… one that ends in .me.uk ? How many of those have you ever sent mail to?
Does a personal home page stop being a personal page when people stick up Google Ads and Affiliate Links, and turn it into a money making enterprise??
In the e-mail arena I believe a non .me.uk e-mail address is considered personal if it came with a subscription to a home ISP. Perhaps that definition could be extended here; personal home pages are all those pages hosted on webspace provided by an ISP with a domestic internet access account?
Comment by odd stuff at amazon — 26 July 2004 @ 4:29 pm