Nutch…too much Nutch


Yesterday the whole day was spent in trying to go through the Nutch source code. Chris and Ashish helped me out alongwith this link
Dissecting the Nutch Crawler
. This showed me that :
The file Fetcher.java has a reference to the “content” variable (which is of type Content). I found that initially only the URLs are stored during the crawl, then a request is sent. Then based on the MIME type of the content returned, the ParserFactory class creates a parser (html parser, pdf parser etc.). The code for these parsers can be found at nutch-0.6/src/plugin/. These plugins do the parsing and get the content as a “Parse” object. Using the Parse.getText() method (which we also felt was interesting) we can get the text content of any page!!!!!


Nutching Nutching Nutching


The whole day today was spent in analyzing Nutch Source Code with Anshul. It is almost 8:00 pm now and nothing has been done yet! Have received an e-mail from Chris Mattman and Ashish Vaidya giving some pointers. Hopefully, it’s gonna help!
Had problems while compiling the code as well. It’s strange that when I installed j2sdk from Java Sun site I did not get javac in the /usr/java/jre1.5.0_02/bin directory. So since I did not have enough time to look for the files, I simply downloaded Netbeans and got the thing compiled with Daddu’s suggestion!
Will blog later when I can get some success with Nutch.
Rajat’s Abode