Saturday, October 2, 2010

how google, yahoo, bing retrieve results sooo quickly

The other day i was working with multi curl in php to retrieve the results from google, yahoo and bing.. now remember how it's soo surprising that these search engines retrieve results containing millions of documents within less than a second... everyone thinks it's because of powerful and distributed servers but there is also a very simple thing that all of the above search engines return only top 1000 documents.. if you try to go beyond that it just doesn't allow. for example, try the following urls;

http://www.google.com.pk/search?q=nirvana&hl=en&client=firefox-a&hs=v61&rls=org.mozilla:en-US:official&prmd=vli&ei=LCinTP7ENZKiuQPAp8SADQ&start=640&sa=N

http://www.bing.com/search?q=nirvana&go=&qs=n&sk=&sc=4-7&first=1596&FORM=PERE7

http://search.yahoo.com/search;_ylt=A0oGdUnHKKdMFTkAMGhXNyoA?p=nirvana&ei=UTF-8&fr=yfp-t-963&xargs=0&pstart=1&b=1101&xa=cymfuMaNmNDApYTZxBtEDg--,1286109767


now in google search the term "start = 640" means return documents starting from 640th document...
in bing search the term "first = 1596" means return documents starting from 1596th document...
in yahoo search the term "b=1101" means return documents starting from 1101th document...

now, if you visit the google url mentioned above, you'll see that the pages beyond 65 are not even visible and if you try to go beyond that, it simply comes back to the 65th page. similarly, yahoo and bing don't show page links for more than 100

now, this shows that google doesn't return documents more than 650 if you use "show 10 results per page" and 700 at max if you use "show 100 results per page" whereas bing and yahoo both only go till 1000...

another interesting thing is that even google shows an error page if you try to go beyond 1000, try the following url;

http://www.google.com.pk/search?q=nirvana&num=100&hl=en&lr=&client=firefox-a&hs=Mqi&rls=org.mozilla:en-US:official&prmd=b&ei=Lj2nTPDODYTovQOI5d37DA&start=2000&sa=N

the above simply displays the error that "Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 2000.)".


what these search engines are doing is that during indexing, they store the count of documents containing a specific term.. so all they do is that they take the union of the number of documents containing the terms and show that in the top (referring to ... out of 1,554,874 results).

I remember building a simple search engine based on reuters dataset (http://www.daviddlewis.com/resources/testcollections/reuters21578/) and on a 3.02 GHz 64 bit machine with 2 GB ram, it returned 1000 documents in 250 milliseconds. Now, I wonder the reason behind the quick retrieval by google, bing and yahoo is distributed servers and complex algorithms or just a simple thought that humans won't even go through 100 results for a search query so let's just retrieve only top 1000 results.

1 comment:

  1. WoW!!
    You sure have unveiled what google was able to protect for decades!
    Soo elegant and yet so simple!

    ReplyDelete