developers.hover.in / 8 posts / categories / 16 comments / feed / comments feed

intern challenge - 1

Problem:

Given a URL find out the location of the RSS feed and show the corresponding posts.
Bonus: Provide context by finding related posts to a term / given query

Steps:

1. Find the various feeds from the URL using a parser or a similar web service
2. Taking the feed and show the posts
3. Finally host the application
4. Bonus: pass in the context using the HOVER APIs

Solutions:

· Using server side scripting (Perl/PHP/Erlang) to discover RSS / Atom feeds.
· Using Yahoo Pipes to auto discover RSS / Atom feeds.
· Using YQL to fetch the links to RSS.
· Using Google Ajax APIs to find Feeds on Zembly or Appjet or a Hoverlet.

Using PHP to discover RSS/Atom feeds

We have been using PHP for developing our different projects like the Wordpress plugin named wp-hover. So we decided to start with PHP.

Firstly this is how typical RSS links appears in a page.

<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://developers.hover.in/blog/feed/" />

We tried to parse using Zend_Feed::findFeeds(<URL>) but it was not giving RSS links as expected. At times it was just printing whole content of the body, and other times gave DOM Element objects which when var_dumped contained just integers and some metadata. If we could get the result using the above function, then we could have easily parsed that RSS URL using the function Zend_Feed::import(<URL of the discovered RSS>).

The pseudocode for the solution in PHP:

//finding the rss feed
$feedArray= Zend_Feed::findFeeds('http://developers.hover.in’);
// get the RSS feeds
Foreach($feedArray as $item){
// get the rss feed
$rss = Zend_Feed::import($item);
// Loop through the items in the feed
// and print it out

Since the above didn’t work as expected, we looked at other options.

Using Perl to discover RSS/Atom feeds

Similarly in perl which is known for string and regex handling, we found Feed::Find for this purpose which worked much better. So we setup perl (and its long list of dependency) as follows:

#Installing Perl with CPAN modules (Ubuntu) using apt-get to install :
sudo apt-get install build-essential  libssl-dev   libc6-dev    perl    yaml-mode
#followed by :
perl -MCPAN -e ‘install Feed::Find’
#or running cpan terminal and use command ‘install Feed::Find’

Perl Script:

#!/usr/local/bin/perl
use Feed::Find;
@feeds = Feed::Find->find('http://trak.in/');
print "@feeds";

Using the above script we were able to find all the RSS / ATOM links for the given URL. By default it finds all kinds of feeds. It would have been better if the find function took second argument of <FEED_MIME_TYPE> currently you need to set this in the source itself.

Since we interested in web application the following Perl code can be run from a web server as a CGI script.

Using Erlang to discover RSS/Atom feeds

Since erlang is predominantly use at hover.in, we explored the same in erlang. MochiWeb is an Erlang library for building lightweight HTTP servers. Since it is an open source project - it has been used in big projects like Couchdb, erlyweb etc and we were encouraged to test out it’s parsing capabilities.

XPath is a language for addressing parts of an XML document, which is based on element nodes, attribute nodes and text nodes.

Reading through a post on pplov’s blog we came to know that he had contributed an xpath parser for mochiweb which can be downloaded at the mochiweb google group

Code:

module(find).
-export([feeds/1]).
feeds(Url)->
    {ok,{_,_Headers,Body}} = http:request(Url),
    Tree = mochiweb_html:parse(Body),
    Xpath = "//link[@type='application/rss+xml']",
    [ {_Tag1,Attributes1,_Content1}|_Rest] = mochiweb_xpath:execute(Xpath,Tree),
    BinUrl = lists:foldl(
         fun({<<"href">>,Href},_Prev) -> Href;
             (_Else,Prev)-> Prev
         end,error,Attributes1),
    binary_to_list(BinUrl).

To run above code

1>application:start(inets).
ok
2> find:feeds("http://developers.hover.in").
"http://developers.hover.in/blog/feed/"

Conclusion:
Using either PHP or Perl or Erlang or basically similar modules from other languages are defiantly viable options, but we decided to also check out solutions that have come up more recently such as Pipes, YQL , Google AJAX APIs which could be hosted on environments like Appjet (AppJet is a website that allows users to create web based applications in their web browser), Zembly or Hoverlets — hover.in’s own hovering widget hosting environment. But more on the application hosting later. First let’s try other methods to discover feeds from a webpage.

Using Yahoo Pipes to auto discover RSS/Atom feeds

Yahoo! Pipes is a web application from Yahoo! that provides a graphical user interface for building data mashups that aggregate web feeds, web pages, and other services. hover.in has always been a big fan of pipes and in showcasing it.

findfeeds
The above snapshot shows that how simple it is to find all RSS and ATOMs for given URL. You can run the findFeeds pipe at http://pipes.yahoo.com/pipes/pipe.info?_id=00ff6bb493d2785b7594eea76e55c988.

showfeeds

The above snapshot is a clone of the first pipe extended to show the post of an RSS Feed for any given URL. You can run the showRelated pipe at http://pipes.yahoo.com/pipes/pipe.info?_id=_iMHbOhG3hGjsadmgQSecQ as well as view its source.

To see how far we could differ in traditional implementations - the final pipes was hosted on Appjet and hence aptly called pipes-on-a-jet ; ) This is how it looks like in the appjet IDE
appjet1
examples of using pipesOnAJet:

  1. search for kolkata knight riders on trak.in
  2. search for demo on techcrunch

Using Yahoo! Query Language to discover RSS/Atom feeds

YQL (Yahoo! Query Language) is an expressive SQL-like language that lets you query, filter, and join data across Web services.

Running “select * from data where URL=’http://developers.hover.in’” gave the content of the body tag in that page. One feedback we have is that we were not able to find the content of head element tag. We look forward for this feature in upcoming builds. So we tried another query with the help of Open Data Tables (Open Data Tables enable developers to add tables for any data on the Web to our stable of API-specific tables), which returned me the entire link tags in given URL. But this was not sufficient since we had to find the links to all the RSS feeds. So playing around with it we found that a clever hack was to provide a minimal CSS selector and we got this!
yql1
Final statement:

use 'http://yqlblog.net/samples/data.html.cssselect.xml' as data.html.cssselect;
select * from data.html.cssselect where url="<URL>" and css="link"

You can execute this here:

Using Google AJAX APIs to discover RSS/Atom feeds

Google’s AJAX APIs let you implement rich, dynamic web sites entirely in JavaScript and HTML. You can add a map to your site, a dynamic search box, or download feeds with just a few lines of JavaScript. Unlike most javascript libraries out there - this one focuses more on data and less on the typical UI capabilities.

google12
Google Ajax libraries snippet to discover RSS and show the post

google.load("feeds", "1");

function OnLoad() {
   var query;
   // Query for finding posts related to erlang on the dev blog
   query = 'site:http://developers.hover.in/ erlang';

   // OR Query to find related posts to the hovered word within a hoverlet
   // query = 'site:' + HOVER.site +' '+ HOVER.kw;

   google.feeds.findFeeds(query, findDone);
}

function findDone(result) {
   //traverse and print out
}

google.setOnLoadCallback(OnLoad);

You can run find the above code live here, and has been used for a related posts from your blog hoverlet and is being used on sites like trak.in to show related posts to Kolkata Knight Riders , and basically any word that the hover.in user specifies in his dashboard.

And finally…..here’s the result of a couple of days of hacking around with perl, php, erlang+yaws, cgi, y!pipes, y!ql, appjet, google apis . To top it all– I think it’s safe to say that it took longer to edit this post though! ; )

hoverlet

You can look forward to more posts that deal with the web apps, API’s and hosting environments – apart from the official HOVER API documentation that will be announced soon that will enable you to build your own contextual applications. Signup, get in touch with us for more or follow hover.in on twitter.

~
for hover.in

Kanchan, Ravi, Zeeshan
( o8-o9 hover.in developer interns from Symbiosis, Pune)

4 Comments

  1. Arun — May 25, 2009 #

    Hey Guys…excellent work. The hoverlet created by you is going to be a boon to bloggers. Keep up the good work !

  2. Eben — May 29, 2009 #

    Nice Post!! Very Informative and educative!!

  3. Sam — May 29, 2009 #

    You can use YQL to get that element by using the xpath key:

    select * from html where url=’http://developers.hover.in’ and xpath=’//head/link[contains(@type, "rss")]‘

    Sam

  4. Prasad — June 3, 2009 #

    Hi guys,

    I would like to meet you regarding this. Have sent you the mail address and reply to that.

Leave a comment