Building a better Google
March 8th, 2011 by David Bradley >> 2 Comments
Search engines are big business so it’s perhaps no surprise that the companies running them expend so much energy trying to make their search engine results pages (SERPs) as good as possible. Superficially, this is done for their users, but given that the business models rely on good results generating good advertising revenues it’s also no surprise that those same companies want their SERPs to be as worthy as possible so that they can serve up solid ads at a good price.
With every new tweak to a search engine’s ranking algorithm, however, there are winners and losers. In an effort to prevent webmasters and those techno-marketing types known as search engine optimization (SEO) companies that many employ from gaming the ranking system, the search companies carry out frequent updates. They try to wheedle out the spam, to de-rank the splogs. To cut the relevance of link farms and most recently in the case of the major player in search to eradicate so-called content farms. Unfortunately, the algorithms they employ are not perfect. Ever since the first electrons pinged across the wires there has been some kind of spam, but today it seems the SERPs of the major search engines are become less and less useful to us, the end users.
Yahoo has fallen out of favor with most users, Microsoft Bing allegedly “borrowed” Google’s results and with Google’s latest algorithm update aimed at getting rid of content farms, it seems that solid, well-known sites have been thrown out alongside many black hat sites. Moreover, as anyone who has Googled anything, and that’s pretty much everyone, knows, there is a huge level of redundancy in the SERPs of most search engines and not necessarily any understanding of context of variation in content.
For instance, search for the term “flash” and a search engine will throw back three or four links to Adobe’s product, Flash, and then a few thousand more pages of random stuff. It’s unlikely that you will see “flash memory” or “Federal Alliance for Safe Home” or other non-Adobe entries in the top 10 hits. When I did that search today, the first four entries pointed to Adobe’s various websites. The fifth and sixth entries were to Wikipedia pages about Adobe Flash, the next was for an Adobe Flash tutorial, it wasn’t until I scrolled below the fold that non-Adobe subjects appeared. A similar problem arises with other ambiguous search terms such as apple, virgin, cougars, lions, jaguar, football, tea party etc.
That’s not good.
Two computer scientists at Oakland University, in Rochester, Michigan, hope to change all that. Guangzhi Qu and grad student Hui Wu recognize as well as anyone that the explosive growth in information technology, knowledge and the Internet means that a good search engine is an indispensable tool. They point out that the ranking schemes used by the mainstream commercial search engines that produce a long list of results is perhaps not the most effective approach to finding information.
“In an ideal search experience, the search engine should respond to a query with as many relevant documents as possible in the top search results and minimize the requirement of reformulating the query for further searching,” the pair says. Additionally, duplicated or redundant entries, such as a cluster of hits for the same company page, could be culled to increase the diversity of results displayed and so give the end user more chance of finding the information they need quickly.
The researchers have now developed an adaptive topic discovering (ATD) algorithm that weights the relevance of complicated documents in a set of search results. It is based on an initial systematic, unsupervised learning method that builds up a glossary, or in geek-speak an ontology, associated with a large sample of documents. Each entry in the results is effectively tagged behind-the scenes and then each ranked accordingly. The result is that whereas Google on the first page of results has mostly Adobe Flash links when you search for “flash” the ATD algorithm “knows” that not all flash is Adobe and so removes redundancy ranks non-Adobe items higher. It would work equally well in knowing that there are lions the growling cats, the Detroit Lions, Lions Club International and the Texan rock band, The Lions.
The researchers explain that much effort has gone into studying the diversity of search engine results and that on the whole it has taken a statistical approach to devising new algorithms. The Oakland team’s approach side steps the need to gather statistical information and instead figures out the relationships between documents before ranking results. Whether or not any major search engine is likely to adopt such an approach is probably a moot point. But, if Qu and Wu can give us better search results then they could be on to a winner. Maybe someday soon we’ll be Qu-Wu’ing for information rather than Googling it.
Qu, G., & Wu, H. (2011). A weighted-graph-based approach for diversifying search results International Journal of Knowledge and Web Intelligence, 2 (1) DOI: 10.1504/IJKWI.2011.038626
A search for Qu-Wu.com brought up no results in Google but there is someone on Facebook called Qu Wu, a geologic formation called the Quwu, and a bargain shopping site with a Tokelau (.tk) domain.