Homepage | SemantiFind - Semantic catalog of the Internet

The Semantic Spotlight

Power to the People.

Bruce Johnson - Wed Oct 29 18:00:21 MDT 2008

We often get asked why we took the approach we did with SemantiFind and what makes it better than alternative approaches. First of all, let me say that we don’t for a minute believe that we have found the ultimate answer to all aspects of semantic search. We do believe, however, that we have a unique, practical and elegant approach that can realistically deliver more relevant results for a wide variety of queries.

In a previous post I discussed the issues facing conventional search engines – ambiguity, having to guess search terms and noise. SemantiFind addresses these problems by using a kind of “dictionary / thesaurus” which we refer to as an ontology.

New York City

The ontology serves two purposes. First of all it allows users to pick the definitions of their search terms as they enter them. This eliminates ambiguity and the problems that arise from it. Secondly, the ontology maps out the relationship between equivalent terms – think synonyms, abbreviations and acronyms. So, for example, it knows that “NY”, “New York” and “Big Apple” can all be used as different ways to refer to “New York City”. This allows users to say things the way they want without having to guess what their search engine wants.

However, as mentioned in my prior post, this still leaves the problem of semantically labeling web pages with the ontology so that the task of matching pages to search terms can be performed.

Broadly speaking there are a couple of different ways in which classifying web pages can be performed – using machines or using people. We have chosen to use people to perform this classification for a number of reasons.

Before I get into the advantages of human classification, let me just point out one major drawback. The internet is huge – some trillion or so pages – and getting humans to classify it completely is a daunting task at best. Still, with enough users, classifying the better, more meaningful pages is well within the realm of possibility.

Additionally, the way we have designed SemantiFind, you don’t have to wait until the entire internet (or at least the best parts of it) have been classified. Using SemantiFind has immediate benefit for users because it allows them to explicitly and implicitly (automatically) classify and catalog (index) pages that they like so that they show up in future searches. And, because SemantiFind exists on-top of conventional search engines (currently, Google, Yahoo! and Microsoft Live Search), end users still get their existing search engine results – they’re just augmented and filtered by SemantiFind to add additional value.

So, now on to the problems with computer based classification. While it is possible to use Natural Language Processing (NLP) in conjunction with a massive crawl of the internet and an ontology to classify pages en masse there are several problems.

First of all, there is the fundamental issue of inferential knowledge that isn’t necessarily present in the document being classified. As a simple example, think of a page with a capsule review of “The Lion King”. Humans will be quick to understand whether it was a review of a play or the film by non-specific cues in the text like “one of the best performances of The Lion King”.

NLP algorithms don’t necessarily understand that the phrase “one of the best performances of…” implicitly means it is a play as movies don’t tend to change between performances.

And, taking this example one degree further, assume a search engine is trying to answer the query “Where is the Lion King playing?” This could mean many things. The first, most obvious (to you and I) meaning would likely be, “In which theatre, in the city I am currently in or planning to visit, is the play, The Lion King currently being performed.”

NLP has several obstacles that need to be overcome. There are so many inferences in this simple query which conspire to make the problem of understanding it quite difficult. First of all, you need to know that the “Lion King” is a play as opposed to some errant monarch (say Richard the Lion Heart), you need to know the user isn’t referring to the movie or the book, you need to understand whether “where” means which specific theater or which city (and, if it is the former, you need to know which city the user is currently in) and so on.

So let’s say for arguments sake that all of these obstacles have been overcome by some incredibly sophisticated, nearly sapient, algorithm. You still have a number of problems in applying this across the web. First of all, NLP only handles simple textual information; not photos, videos or audio. Secondly, there is the issue of language. Grammatical rules change for each language, especially when you get into pictogram languages like Kanji. But lastly, and most importantly, machines are currently devoid of the kind of judgment that humans intrinsically have – so an algorithm may be able to determine what a page is about but it won’t be able to tell you whether it is a good one or not.

But, in reality, NLP and other automated approaches are still very much in their infancy and so judgment is really the last thing they need to worry about – they need to worry about getting it right.

So for now, humans still trump machines. We can deal with all forms of data (video, audio, pictorial, textual) and en masse the crowds “tend towards rightness” and form valuable judgment. So, for now, power to the people.

Incidentally, if you’re interested in learning a bit more about NLP and machine approaches to semantic search, I highly recommend Semantic Search: The Myth and Reality written by Alex Iskold on ReadWriteWeb.