Nutch…too much Nutch

- Rajat

Posted on Apr 24, 2005 in Programming | 0 comments

nutch

Yesterday the whole day was spent in trying to go through the Nutch source code. Chris and Ashish helped me out alongwith this link
Dissecting the Nutch Crawler. This showed me that :
The file Fetcher.java has a reference to the “content” variable (which is of type Content). I found that initially only the URLs are stored during the crawl, then a request is sent. Then based on the MIME type of the content returned, the ParserFactory class creates a parser (html parser, pdf parser etc.). The code for these parsers can be found at nutch-0.6/src/plugin/. These plugins do the parsing and get the content as a “Parse” object. Using the Parse.getText() method (which we also felt was interesting) we can get the text content of any page!!!!!

Nutch…too much Nutch

Leave a Comment Cancel reply

Recent Posts

Nutch…too much Nutch

Leave a Comment Cancel reply

Tags

Recent Posts