Pennyroyal: Extraction of summary of a url (facebook post link feature)

I initially thought to share the ruby on rails code i wrote to achieve this particular functionality, but then i ain't a big fan of rails so here is a simple step by step description of how to achieve this (between this ain't a rocket science)

first of all you need to have a dom parser (there are alot of parsers available in various languages.. choose the one you like)
parse the url and create it's DOM
now use XPATH (or whatever DOM parsing mechanism the library provides)
if you have ever paid a bit of attention to the summary facebook generates against a link, you'll notice that it consists of 3 things ... i) title .. ii) description... iii) image. now let me describe each of them individually

Title: This is the most easiest of all.. you can simply extract the contents (title) from the <title> tag under <head> tag
Description: this is also a very easy thing to achieve.. you can extract the description from the contents attribute of <meta name="description" content="..."> tag.
Images: Now, here is a tricky part.. you need to extract all images available on the page and then choose the largest image (height and width wise) and use that as the thumbnail.. if you want to provide a "choose thumbnail" feature like fb, then simply extract all images which have height and width greater than 50px and show all those in a simple slideshow. One might think that this is time consuming, and i agree.. there is a small hack but that is not applicable on all links. Since, fb introduced "open graph", the websites who have moved to it use tags like "<meta property='og:image' ...>" to tell fb the representative image of the page. you can simply use the image given in this tag (if the tag exists) and save all the trouble of going through the complete DOM.

After this all you need is to display the extracted stuff according to your style.

Will try to post PHP code for this if i get time otherwise might even post code of rails

Pennyroyal

Wednesday, February 9, 2011

Extraction of summary of a url (facebook post link feature)

1 comment: