Weblog on the Internet and public policy, journalism, virtual community, and more from David Brake, a Canadian academic, consultant and journalist

Archive for the 'Search Engines' Category | back to home

12 August 2004

I recently read (on CNet perhaps?) that anonymous people within Yahoo are promising one stop searching of web, email, hard disk and Yahoo services – sometime. I won’t get too excited about that until it gets close to launch.

Meanwhile, “X1”:http://www.x1.com/ (which admittedly costs $75) has been improving rapidly – it now supports boolean and proximity searching of your hard disk, contacts, email (including Eudora and other email apps as well as – and alongside – Outlook I am delighted to say) and email attachments. With those improvements I am going to start trying to use it again regularly. Download their trial version and/or “enter their sweepstakes”:http://www.x1.com/sweepstakes/index.html to win up to 50 copies.

For more on what Microsoft is up on this see “this post”:https://blog.org/archives/cat_search_engines.html#001134 and for Google’s plan’s “see here”:https://blog.org/archives/cat_search_engines.html#001119.

Update: Jeremy Wagstaff who shares my obsession with hard disk search has just posted a “discussion”:http://loosewire.typepad.com/blog/2004/08/the_new_search_.html of the race to provide good local search and a (probably comprehensive) “list of available programs”:http://loosewire.typepad.com/blog/2004/08/a_directory_of__2.html including three I have not yet tried – all free of charge – “Tukaroo”:http://www.tukaroo.com/, “Wilbur”:http://wilbur.redtree.com/index.htm (which is also open source) and “Blinkx”:http://www.blinkx.com/

28 July 2004

Microsoft Watch has created a Web ranking tool which brings together various publicly-available ways of assessing the popularity of a given page – Google’s PageRank, Alexa’s traffic rank and a count of total external backlinks from Yahoo (which reports these much better than Google apparently).

None of these are very precise measures but they are the only ones available for sites that are not big enough to turn up in commercial surveys of web popularity as far as I know – anyone got any better ones? Come to that are there any easy ways to get at some of the site popularity data produced by people like “Comscore”:http://www.comscore.com/ without paying them commercial rates?

21 July 2004
Filed under:Academia,Search Engines at10:08 am

As I “posted earlier”:https://blog.org/archives/001126.html I would like to find a way to make a random sample of home pages from the UK. As it turns out if you search for “personal home page” and specify you are only interested in UK pages, Google and Yahoo will give you a selection that includes lots of home pages (the UK versions of both understand whether sites are UK or not though the algorithm is not perfect). But I worry a little that there are lots of home pages that do not include the text ‘home page’ prominently and that they might actually tend to be a different kind of home page (so excluding them tacitly might skew the results).

I also found that the two largest ISPs in the UK (I think) – “AOL UK”:http://hometown.aol.co.uk/mt.ssp?c=9011000 and “Wanadoo”:http://www.wanadoo.co.uk/sitebuilder/search.htm (was Freeserve) have pages where you can search home pages created by their members. If you search these for a common word like “the” you can also get a seemingly random sample but this might be tainted by any demographic skew in the kind of people who choose to use those tools. What do you think of using that as a method?

Are there other ways of sampling by keyword you could suggest? Any articles about web page sampling you can recommend?

P.S. I came across an “attempt at automating page classification”:http://students.iiit.ac.in/~kranthi/professional/papers/ieee_wpcds_1.shtml which the authors claimed works but unless I could somehow run it myself on a collection of UK URLs (and defend its reliability) it probably wouldn’t be of much help. I also ran across a second paper on “automated web page classification”:http://csdl.computer.org/comp/trans/tk/2004/01/k0070abs.htm but I couldn’t access it and it didn’t look as if it could help in any case unless I was trying to build my own search engine.

12 July 2004
Filed under:Search Engines,Software reviews at8:18 am

I’ve heard for a while that Microsoft plans to produce a single search tool that finds data on your hard disk and on the Internet but I have always assumed they meant to deliver it in their next operating system (Longhorn) in 2006. Now according to Yusuf Mehdi, head of Microsoft’s MSN division it seems this technology will be released within 12 months. Apple also plans to incorporate this kind of search in its OS but “as with Windows”:https://blog.org/archives/cat_search_engines.html#001061 third party apps for Mac OS X are “already available”:http://www.wired.com/news/mac/0,2125,64070,00.html to search your hard disk.

8 July 2004

An academic study on “adult learners and how they search for information”:http://www.elearningeuropa.info/doc.php?lng=1&id=5075&doclng=1&p3=1 reveals much I could have guessed but some new things too.

Only in three out of the fifty scenarios performed, the participants (one different in each case) visited a second Web page of alternatives produced by the search engine. In no case did the participants check more than eight websites, and in twenty cases out of the total fifty they only checked one website.

It also backs up what I suspected/feared about search engine use – the illusion that it is easy causes most people not to bother to invest the time to learn how to do it well. As they said:

Computer programmes, like the use of search engines appear as something not worthy to make the effort of learning. An apparent intuitive handling encourages this way of thinking. However, intuition depends on what is known and with what analogies can be built. If the analogies are incorrect, then the use of software will inevitably lead to disorientation

The full report is at “SEEKS”:http://www.seeks-it.net/.

Thanks to Pandia for providing a link and a summary of the results

4 July 2004
Filed under:Search Engines at10:19 am

The MSN Sandbox where Microsoft showcases its Internet content-related technologies (similar to “Google Labs”:http://labs.google.com/) now has a preview edition of Microsoft’s new “search engine”:http://techpreview.search.msn.com/. While it works there are no special features I can see or advanced search syntax to try out at the moment.

17 June 2004

Seb Paquet references an “interesting paper”:http://www.arl.org/arl/proceedings/138/guedon.html on the history of scientific publishing and the impact of ISI ranking. It points out how assigning numerical rankings to measure academic quality distorts the way that academic research is published.

What that paper doesn’t mention – at least not in ch 6 which Seb highlighted – is that because high citation ranking = $ many journals end up “gaming” their impact factors by choosing the kind of papers they publish in order to maximise it, which has unintended consequences. If a journal has 10 papers that it knows will be highly cited it may limit the number of other papers it accepts for example to try not to ‘dilute’ its impact factor.

It’s the same with the ranking systems used by Google and by weblog ranking search engines. If there are benefits to being scored highly, human nature being what it is people will try to maximise their scores. Yet because the ranking is ‘automatic’ it is often assumed to be value neutral and therefore above criticism.

15 June 2004

I wish I had the time to do a proper write-up of the NotCon session I attended featuring Brewster Kahle, the man behind the Internet Archive whose mission is nothing less than to provide universal access to all human knowledge. Here is some stuff I noted instead.

Some interesting factoids from his presentation:

* There are 150,000 people using the Internet Archive per day. It stores 3-400Tb of data and recently upgraded to 1Gbps bandwidth.
* There were 300,000 to 600,000 scrolls in the Library at Alexandria. Only around eight of them are left.
* You can store the contents of the Library of Congress as plaintext (if you had scanned it all) on a machine costing $60,000.
* The bookmobile he produced that is connected to the Internet via satellite, travels the world and produces complete bound books from a collection of 20,000 public domain works cost just $15,000 – and that includes the van itself.
* He says that it costs him $1 to print and bind a public domain book – I assumed the books produced would be very rough and ready but he brought some along and they were almost as good as the kind you’d buy in a shop. I suspect he may be stretching the truth a bit – I believe the $1 a book cost he quotes is for an 100 page black and white printed booklet. It’s still impressive though especially as:
* He notes it costs US libraries $2 to issue a book. He suggests they could give people copies of public domain books for $1 instead and pay another $1 to the author to compensate them.

Like many geniuses he just doesn’t know when to stop and thankfully he has a private income from a dotcom or two he was involved with that enables him to try out lots of projects. Aside from archiving the web, movies, books and music he’s:

* taking the US to court to try to get their boneheaded copyright laws changed
* working on mirrors of his San Francisco-based archive in Alexandria and Amsterdam (hosted there by XS4all)
* encouraging anyone to upload anything to his archive (copyright permitting) offering unlimited bandwidth indefinitely (though the site doesn’t make it very easy to figure out how you are supposed to take advantage of this generous offer) including performance recordings of bands that have given their permission.
* Trying to collect and save old software (he got special dispensation from the US copyright office to do this for the next three years but can’t make it available). He does want your old software however so if you’ve got some he would like you to send it to him – in physical form with manuals where available. He’s even
* Trying to provide fast, free wifi across all of San Francisco.

He’s so hyperactive my fingers get sore just typing in all of the projects he is involved with! I worry that he’s taking on too much and that some of it may fall by the wayside if something happens to him. But his enthusiasm and his optimism are infectious. I am pleased to have been able to shake his hand.

P.S. Ironically, I recorded his presentation and have it in MP3 format but because it was 21Mb I can’t serve it myself and so far nobody has stepped forward to host the file. I finally found how to upload it but then discovered I deleted the original file once I passed it on to someone else to upload! So I hope someone still has them – if it does get posted I’ll tell you where.

29 May 2004
Filed under:Interesting facts,Search Engines at11:02 am

An article in “Knowledge Management World”:http://www.kmworld.com/publications/magazine/index.cfm?action=readarticle&Article_ID=1725&Publication_ID=108 suggests a lot of what knowledge workers do is re-creating knowledge that is already available but they didn’t find.

It’s ironic (and infuriating) – particularly for a magazine all about knowledge management – that none of the catchy factoids like, “90% of the time that knowledge workers spend in creating new reports or other products is spent in recreating information that already exists” come with citations so there’s no way to check their methods (though I’m guessing they aren’t particularly rigorous).

Thanks to Lilia ‘Mathemagenic’ Efimova for the link. She notes interestingly that maybe some people prefer to re-discover things themselves because learning for yourself is more fun than researching it…

24 May 2004

I have been thinking for a little while now that something needs to change in the practice of blogrolling. People use a lengthy blogroll to indicate what other blogs they consider interesting (telling something about their own interests) and to encourage others to link to them, but what use are they to the rest of the people actually reading the weblogs themselves? “BlogRolling”:http://www.blogrolling.com/members.phtml’s practice of just listing them all in a column without comment seems to me particularly pointless – who is going to go and look through all the blogs on someone’s list of 50 – a mix of friends and work colleagues and random interesting stuff – on the off chance that some of them will be interesting?

That’s why I have a “single link”:http://www.bloglines.com/public/derb/ on my already over-crowded right hand bar which leads you to the 104 weblogs I am currently tracking, all sorted into categories and sometimes even with descriptions thanks to “bloglines”:http://www.bloglines.com/. But I worry that automated tools that measure my connectedness like “Technorati”:http://www.technorati.com/ will not capture this and visitors may overlook the link.

So do I include a long useless list of links somewhere just so robots can read them? What do you think? Is there a better way to tackle this? Could bloglines and the blog indexing/ratings people get together somehow?

? Previous PageNext Page ?