Friday, July 17, 2009

Nutch: Getting my Feet Wet

My motivation for learning Nutch is twofold. The first is that we are using Nutch for a number of our more recent crawls, so I figured its something I should know about. The second is that Nutch uses Hadoop Map-Reduce, so I figured I would get some Map-Reduce programming tips by looking at Nutch sources.

This post describes my attempt to crawl this blog using Nutch and index it. It also describes a very simple plugin to filter URLs by pattern at index time.

Command line Usage

Nutch can be used as a Search Appliance that bundles crawling, indexing and search. Such bundling can be very useful if you just want to use Nutch with whatever it has (and it has plenty for most people), but I was more interested in trying to write some plugins for it, so my approach is to crawl once, then reuse the downloaded pages as an input data to test my plugins.

Nutch's "crawl" subcommand is actually a combination of other subcommands, as detailed on this page. Before I ran a crawl, however, I needed to set up two things.

  1. Create Seed List.
  2. Customize Nutch Configuration.

Create Seed List

This is a flat file consisting of top level URL(s) for the site being crawled. In my case, I just have one URL as shown below.

1
http://sujitpal.blogspot.com/

Customize Nutch Configuration

Nutch's main configuration file is in $NUTCH_HOME/conf/nutch-defaults.xml, where $NUTCH_HOME is the directory where you installed Nutch. In my case it is /opt/nutch-1.0. The recommended approach is to override the entries that you need in conf/nutch-site.xml, which I did. Here is what my nutch-site.xml file looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>http.agent.name</name>
    <value>testnutch</value>
  </property>
  <property>
    <name>http.robots.agents</name>
    <value>testnutch,*</value>
  </property>
  <property>
    <name>http.agent.description</name>
    <value>Nutch testing</value>
  </property>
  <property>
    <name>http.agent.url</name>
    <value>http://sujitpal.blogspot.com</value>
  </property>
  <property>
    <name>http.agent.email</name>
    <value>me@myprovider.net</value>
  </property>
  <property>
    <name>plugin.includes</name>
    <value>
      myplugins|protocol-http|urlfilter-regex|parse-(text|html\
      |js)|index-(basic|anchor)|query-(basic|site|url)|\
      response-(json|xml)|summary-basic|scoring-opic\
      |urlnormalizer-(pass|regex|basic)
    </value>
  </property>
</configuration>

Here is the crawl command to do an initial crawl, all the way to the index, in one fell swoop. I figured I would do the subcommands separately when I do the crawl the next time round. You may have to experiment with the depth a bit - my depth of 4 was based on the depth of the navigation path to a post using the left navigation bar.

1
2
3
4
sujit@sirocco:~$ CRAWL_DIR=/home/sujit/tmp
sujit@sirocco:~$ cd /opt/nutch-1.0
sujit@sirocco:/opt/nutch-1.0$ bin/nutch crawl $CRAWL_DIR/seeds.txt \
  -dir $CRAWL_DIR/data -depth 4 2>&1 | tee $CRAWL_DIR/crawl.log

Once the crawl is done, $CRAWL_DIR/data looks like this:

The crawldb contains information about the various URLs that were crawled, the linkdb stores incoming links for each url, segments is where the raw page contents are stored, indexes are where the initial index is created, and index is where the contents of indexes are merged into the final index.

You can quickly see how the crawl did by looking at its statistics. Here the db_fetched is the number of URLs that were fetched. The db_unfetched indicates how many could not be fetched for some reason (more on that later), and db_gone means it could not even find it (ie HTTP 404).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
sujit@sirocco:/opt/nutch-1.0$ bin/nutch readdb \
  $CRAWL_DIR/data/crawldb -stats
CrawlDb statistics start: /home/sujit/tmp/data/crawldb
Statistics for CrawlDb: /home/sujit/tmp/data/crawldb
TOTAL urls: 446
retry 0: 446
min score: 0.0
avg score: 0.006143498
max score: 1.159
status 1 (db_unfetched): 115
status 2 (db_fetched): 222
status 3 (db_gone): 109
CrawlDb statistics: done

You can also get some detailed information by dumping out the crawldb and processing it. For example, I wanted to see what URLs were retrieved in the crawl, with a view to suppressing the non-blog post pages. Here is what I did.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# dump out the crawldb to a structured text format
sujit@sirocco:/opt/nutch-1.0$ bin/nutch readdb \
  $CRAWL_DIR/data/crawldb -dump $CRAWL_DIR/reports_crawl
# see how many URLs were crawled
sujit@sirocco:~/tmp/reports_crawl$ cat part-00000 | \
  cut -f1 -d"\t" | \            ## \t is a ctrl-V ctrl-I
  grep -E "^http://" | wc -l
446
# see how many URLs there will be if we remove the ones we don't want
sujit@sirocco:~/tmp/reports_crawl$ cat part-00000 | \
  cut -f1 -d"\t" | \            ## \t is a ctrl-V ctrl-I
  grep -E "^http://" | \
  grep -vE "archive|label|feeds" | \
  grep ".html" | wc -l
146

Similar to the crawldb, one could dump out the linkdb as well. The linkdb contains information about incoming links for each URL. This is used for link analysis. I didn't have a use (not yet, at least) for the linkdb results. The command to dump the linkdb is:

1
2
sujit@sirocco:/opt/nutch-1.0$ bin/nutch readlinkdb \
  $CRAWL_DIR/data/linkdb -dump $CRAWL_DIR/reports_link

To run the indexing process separate from the crawl (perhaps to regenerate the index after changing or adding a plugin, you need to run the index, dedup and merge commands. The index subcommand writes to the indexes subdirectory, dedups it there, and the merge copies the merged version into the index subdirectory.

1
2
3
4
5
6
7
8
sujit@sirocco:/opt/nutch-1.0$ bin/nutch index \
  $CRAWL_DIR/data/indexes $CRAWL_DIR/data/crawldb \
  $CRAWL_DIR/data/linkdb $CRAWL_DIR/data/segments/*
sujit@sirocco:/opt/nutch-1.0$ bin/nutch dedup \
  $CRAWL_DIR/data/indexes
sujit@sirocco:/opt/nutch-1.0$ bin/nutch merge \
  -workingdir $CRAWL_DIR/data/work $CRAWL_DIR/data/index \
  $CRAWL_DIR/data/indexes

A Simple Plugin

Nutch's power comes from its plugin based architecture. Its core is quite small, but user-written plugin code can be plugged in to its various extension points. As this page says:

Since everybody can write a plugin, hopefully in future there will be a large set of plugins to choose from. At that point Nutch administrators will each be able to assemble their own search engine based on her/his particular needs needs by installing the plugins he or she is interested in... Each developer is focused on his/her own context. The core developers are able to write code for the nutch core engine and provide a described interface - a plug. A plugin developer is able to focus on the functionality of their specific plugin without worrying about how the system as a whole works. They only need to know what kind of data the plug and the plugin exchange. Since both sides are encapsulated nobody needs to take care of the integration of the other side...

To get familiar with how to write such a plugin, I decided on a really simple one that throws away pages that are not my blog posts - these include the top page, all the archive pages, and the comment feeds. I had initially planned to make it a URL filter, but that would have meant that this page would never be crawled, and therefore I would miss the post pages that are accessible from these pages. So the plugin I describe will filter out pages with these specific URL patterns at the indexing phase, ie, it will be an IndexFilter.

I followed this step-by-step guide to writing a Nutch plugin. It is written for version 0.9, but it is accurate (as far as I can see) for version 1.0 as well. For development, I found this page very helpful for getting Nutch setup with all its plugins within Eclipse - a starting point for writing my own.

The only extra JAR files I had to add are the jai_core.jar and jai_codec.jar that were listed in the Nutch 1.0 distribution's README.txt.

To hold my plugins, I created a plugin source directory called myplugins. My plugin contains three files, an Ant build.xml file, a plugin.xml file to describe the plugin to Nutch, and the Java code for the plugin. The directory structure is the one recommended by the step-by-step guide.

The build.xml just delegates to the plugin's build-plugin.xml, so its basically a one-liner, as shown below:

1
2
3
4
<?xml version="1.0" encoding="UTF-8"?>
<project name="myplugins" default="jar">
  <import file="../build-plugin.xml"/>
</project>

The code for the InvalidUrlIndexFilter.java is shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// Source: src/plugin/myplugins/src/java/com/mycompany/nutch/indexing/InvalidUrlIndexFilter.java
package com.mycompany.nutch.indexing;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.log4j.Logger;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.parse.Parse;

/**
 * This indexing filter removes "invalid" urls that have been crawled
 * (out of necessity, since they lead to valid pages), but need to be
 * removed from the index. The invalid urls contain the string 
 * "archive" (for archive pages which contain full text and links to
 * individual blog pages), "label" (tag based search result page with
 * full text of blogs labelled with the tag, and links to the individual
 * blog pages), and "feeds" (for RSS/Atom feeds, which we don't care
 * about, since they are duplicates of our blog pages). We also don't
 * care about the urls that are not suffixed with a .html extension.
 * @author Sujit Pal
 * @version $Revision$
 */
public class InvalidUrlIndexFilter implements IndexingFilter {

  private static final Logger LOGGER = 
    Logger.getLogger(InvalidUrlIndexFilter.class);
  
  private Configuration conf;
  
  public void addIndexBackendOptions(Configuration conf) {
    // NOOP
    return;
  }

  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
      CrawlDatum datum, Inlinks inlinks) throws IndexingException {
    if (url == null) {
      return null;
    }
    if (url.find("archive") > -1 ||
        url.find("label") > -1 ||
        url.find("feeds") > -1) {
      // filter out if url contains "archive", "label" or "feeds"
      LOGGER.debug("Skipping URL: " + new String(url.getBytes()));
      return null;
    }
    if (url.find(".html") == -1) {
      // filter out if url does not have a .html extension
      LOGGER.debug("Skipping URL: " + new String(url.getBytes()));
      return null;
    }
    // otherwise, return the document
    return doc;
  }

  public Configuration getConf() {
    return conf;
  }

  public void setConf(Configuration conf) {
    this.conf = conf;
  }
}

The plugin.xml file that describes the filter to Nutch looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
<?xml version="1.0" encoding="UTF-8"?>
<plugin id="myplugins" name="My test plugins for Nutch"
    version="0.0.1" provider-name="mycompany.com">

   <runtime>
     <library name="myplugins.jar">
       <export name="*"/>
     </library>
   </runtime>

   <extension id="com.mycompany.nutch.indexing.InvalidUrlIndexFilter"
       name="Invalid URL Filter"
       point="org.apache.nutch.indexer.IndexingFilter">
     <implementation id="myplugins-invalidurlfilter"
         class="com.mycompany.nutch.indexing.InvalidUrlIndexFilter"/>
   </extension>
</plugin>

To compile this, we execute ant at the myplugins directory, then we go up a level and execute ant again. This builds the myplugins.jar, and copies it, and the plugin.xml to the build/plugins/myplugins directory.

I use a different nutch installation to run on than the one I developed on, so I created a directory plugins/myplugins under my /opt/nutch-1.0 (the install I run on), and copied these two files in there.

The one other change I did was to turn on DEBUG level logging for my package in Nutch's conf/log4j.properties.

1
2
# Logging for development
log4j.logger.com.mycompany.nutch=DEBUG

To make Nutch recognise the new plugin.xml, the next step is to add this to the plugin.includes in the nutch_site.xml. My nutch-site.xml file shown above already contains this information, essentially I've added the myplugins string to the regular expression for the value tag.

To run it, I run the index, dedup and merge commands, and I notice from the log files that URLs are indeed getting skipped. I had done a quick test on the index that was generated before the plugin, and there were 166 records in there - the index generated after the plugin has only 124. I also had a query (title:april) that was returning 4 archive pages for April, which no longer returns any results, so my filter is doing its job.

This was a fairly simple plugin, but now that I have this working, I have some more plugin ideas that I want to implement, which I will probably talk about next week if I get it done.

27 comments (moderated to prevent spam):

Ami Titash said...

Thanks for the post. It finally got me started as well and helped me cover a lot of ground very fast. btw, I did some thinking on this whole thing and decided I want to investigate apache droids first, rather than go the plugin way for nutch. Somehow it seems to make more sense for my requirements - a strong crawler but with more control on the indexing.

With droids, I am encouraged on reading this http://incubator.apache.org/droids/#Why+was+it+created%3F


If it proves to be very early stage or limited, I will come back to nutch.

:)

~ Titash

Sujit Pal said...

Hi Titash, thanks for the pointer to the droids project, looks interesting, although I think the main draw of Nutch for me is its Hadoop core - so Nutch can perhaps be used as a standalone crawler, and the artifacts it generates can be used to build custom indexing strategies using map-reduce.

Anonymous said...

hi , i tried to compile your plugin but i have the following error :

E:\nutch-svn\trunk\src\plugin\myplugins>ant
Buildfile: E:\nutch-svn\trunk\src\plugin\myplugins\build.xml

init:
[mkdir] Created dir: E:\nutch-svn\trunk\build\myplugins
[mkdir] Created dir: E:\nutch-svn\trunk\build\myplugins\classes
[mkdir] Created dir: E:\nutch-svn\trunk\build\myplugins\test

init-plugin:

deps-jar:

compile:
[echo] Compiling plugin: myplugins
[javac] E:\nutch-svn\trunk\src\plugin\build-plugin.xml:111: warning: 'includ
eantruntime' was not set, defaulting to build.sysclasspath=last; set to false fo
r repeatable builds
[javac] Compiling 1 source file to E:\nutch-svn\trunk\build\myplugins\classe
s
[javac] E:\nutch-svn\trunk\src\plugin\myplugins\src\java\com\mycompany\nutch
\indexing\InvalidUrl\InvalidUrlIndexFilter.java:6: package org.apache.nutch.craw
l does not exist
[javac] import org.apache.nutch.crawl.CrawlDatum;
[javac] ^
[javac] E:\nutch-svn\trunk\src\plugin\myplugins\src\java\com\mycompany\nutch
\indexing\InvalidUrl\InvalidUrlIndexFilter.java:7: package org.apache.nutch.craw
l does not exist
[javac] import org.apache.nutch.crawl.Inlinks;
[javac] ^
[javac] E:\nutch-svn\trunk\src\plugin\myplugins\src\java\com\mycompany\nutch
\indexing\InvalidUrl\InvalidUrlIndexFilter.java:8: package org.apache.nutch.inde
xer does not exist
[javac] import org.apache.nutch.indexer.IndexingException;
[javac] ^
[javac] E:\nutch-svn\trunk\src\plugin\myplugins\src\java\com\mycompany\nutch
\indexing\InvalidUrl\InvalidUrlIndexFilter.java:9: package org.apache.nutch.inde
xer does not exist
[javac] import org.apache.nutch.indexer.IndexingFilter;
[javac] ^
[javac] E:\nutch-svn\trunk\src\plugin\myplugins\src\java\com\mycompany\nutch
\indexing\InvalidUrl\InvalidUrlIndexFilter.java:10: package org.apache.nutch.ind
exer does not exist
[javac] import org.apache.nutch.indexer.NutchDocument;
[javac] ^
[javac] E:\nutch-svn\trunk\src\plugin\myplugins\src\java\com\mycompany\nutch
\indexing\InvalidUrl\InvalidUrlIndexFilter.java:11: package org.apache.nutch.par
se does not exist
[javac] import org.apache.nutch.parse.Parse;
^
^
[javac] 13 errors

BUILD FAILED
E:\nutch-svn\trunk\src\plugin\build-plugin.xml:111: Compile failed; see the comp
iler error output for details.

Total time: 1 second

Sujit Pal said...

Hi, looks like the imports are failing - I worked against Nutch 0.9 perhaps the code is being reorganized with the SVN version. I would check there.

Jonathan said...

Sujit, I encountered the same issue as described by the last user. Do you have any specific solution that can help me out? I thought you were using 1.0 (same here) not 0.9.

In addition, I used your codes in eclipse, and the codes look fine in the IDE.

Thanks.

Sujit Pal said...

Hi Jonathan, yes, looking back at the post and directories on my computer, you are right, I am using Nutch 1.0. Going by your description of the problem (ie working fine in Eclipse, but not in the command line with Ant), I am guessing that nutch.jar is not getting into the classpath because of some build.xml issue. However, I don't remember having to change any build files for this, although its been a while, so my memory may not be accurate. Can you take a look at the output of "ant -v ..." and see if that solves the problem? If yes, would appreciate you posting back here with the solution.

Jitendra said...

Hi,
Your blog was really useful and I could get my crawler up and running.

But log files are not being generated.

I am running job jar shipped with nutch distribution on EC2 with Hadoop AMI. Command I used is

hadoop jar .job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 1

Could you please help me, whats wrong here.

Thanks

Sujit Pal said...

Thanks Jitendra. For logging, you should take a look at Nutch's log4j.properties file - that controls the level and the logging destination.

Pratik K said...

Sir, How does the Nutch stores all the fetched files ie in which format?

Sujit Pal said...

Hi pratikk90, Nutch stores its output as Segment files - there is some information about it here.

Anonymous said...

Hi Sujit,
Thanks for the post. Regarding the plugin development I am trying to create a plugin based on http://wiki.apache.org/nutch/WritingPluginExample-0.9

I have successfully compiled that plugin but I think the Indexer doesn't get called. Then I looked at Indexer.java which gets called when you say /bin/nutch index. The index() method in this class don't call any external indexer. So how does my indexer is called? Am I missing something?

Sujit Pal said...

Its been a while since I trolled through the Nutch code in my IDE, basically the Indexer calls your plugin through a hook which gets populated by the contents of your plugin configuration.

Алексей said...

Hi Sujit!
When i try run ant, i got fallowing error:

root@alexey:/var/nutch/src/plugin/tasix-distinction# ant
Buildfile: build.xml

BUILD FAILED
/var/nutch/src/plugin/tasix-distinction/build.xml:3: The following error occurred while executing this line:
/var/nutch/src/plugin/build-plugin.xml:46: Problem: failed to create task or type antlib:org.apache.ivy.ant:settings
Cause: The name is undefined.
Action: Check the spelling.
Action: Check that any custom tasks/types have been declared.
Action: Check that any / declarations have taken place.
No types or tasks have been defined in this namespace yet

This appears to be an antlib declaration.
Action: Check that the implementing library exists in one of:
-/usr/share/ant/lib
-/root/.ant/lib
-a directory added on the command line with the -lib argument

Anonymous said...

Thank you!
I downloaded the latest version of nutch and did everything in your article, nutch plugin works but has not earned. page processing is as usual, the plug is why it is not connected, as it were

Sujit Pal said...

@alex_root: I looked at line 46 in build-plugins.xml in my own installation (Nutch 1.0), and it looks like its just a path id which is set by individual plugins to add specific JAR dependencies for that plugin (JARs that the plugin needs and which is not supplied by the Nutch classpath). I could not find any plugin called tasix-distinction, so I am guessing its a third-party or your own custom plugin. If third-party, my guess would be that the ant/ivy error is a red herring, its probably trying to get the specific JARs from some ivy repository and not finding it. Although don't know much about ivy, so can't say for sure.

@Anonymous: Given that your plugin doesn't get called, could it be that nutch-site.xml doesn't have your plugin in its plugin.includes? Another place to look would be the class names in your plugin.xml for your plugin.

Unknown said...

Hello Dear:
your post is useful and great
i want to ask you :what should i write in the plugin to remove certain word from the index or crawl words so it doesn't appear when search on it??
what is the java code??

Sujit Pal said...

Thanks Hala. My post here (next one to this one) may provide you with a possible approach. Basically create a custom ParseFilter that gets the body using Content.getContent(), applies a regular expression with your word(s) and replaces with spaces or 'xxx' or something, then sets the body back into Content using Content.setContent().

Unknown said...

I'm writing code in java for aplugin for nutch to remove the movments from arabic words in the indexer.
i use the same code you post ,but I don't know what is the error in it. this is the code:
package com.mycompany.nutch.indexing;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.log4j.Logger;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
//import org.apache.nutch.parsedData.parsedData;


public class InvalidUrlIndexFilter implements IndexingFilter {

private static final Logger LOGGER =
Logger.getLogger(InvalidUrlIndexFilter.class);

private Configuration conf;

public void addIndexBackendOptions(Configuration conf) {
// NOOP
return;
}

public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
if (url == null) {
return null;
}


string parsedData =parse;
char[] parsedData = input.trim().toCharArray();
for(int p=0;p<parsedData.length;p++)
if(!(parsedData[p]=='َ'||parsedData[p]=='ً'||parsedData[p]=='ُ'||parsedData[p]=='ِ'||parsedData[p]=='ٍ'||parsedData[p]=='ٌ' ||parsedData[p]=='ّ'||parsedData[p]=='ْ' ||parsedData[p]=='"' ))
new String.append(parsedData[p]);

return doc.add("value",parsedData);
}

public Configuration getConf() {
return conf;
}

public void setConf(Configuration conf) {
this.conf = conf;
}
}
if(!(parse.getData()[p]=='َ'||parse.getData()[p]=='ً'||parse.getData()[p]=='ُ'||parse.getData()[p]=='ِ'||parse.getData()[p]=='ٍ'||parse.getData()[p]=='ٌ' ||parse.getData()[p]=='ّ'||parse.getData()[p]=='ْ' ||parse.getData()[p]=='"' ))
new String.append(parse.getData()[p]);

return doc;
}

public Configuration getConf() {
return conf;
}

public void setConf(Configuration conf) {
this.conf = conf;
}
}

I think that the error is in using string parsedData =parse; but I don't know what I should use instead of it?
please sujit help me you are greate

Anonymous said...

please can you tel us how to create a plugin for the wordnet library
in jaws api to make nutch work with the wordnet

it like query-wordnet

src/plugin/query-wordnet

thanks

Sujit Pal said...

Hi Hala, apologies for the late reply, blogger thought your comment was spam (I do check both the spam and the input queues now). In any case, I think it may be good to use an HtmlParseFilter to get at the Content object and parse out the characters there, then stuff this into the metadata where it can be picked up by your IndexingFilter. Take a look at this post for more information.

Sujit Pal said...

Hi Anonymous, I haven't done much with nutch query filters. I take it you want to implement synonym expansion with this right? You may want to take a look at the Writing Plugin Example wiki page for some pointers.

David said...

Hi Sujit,
Do you have any idea on how to deal with deadlinks in nutch? say for example I have crawled a blog site today and all the documents are indexed to solr. Tomorrow one of the blog in the above blog site is deleted which mean that one of the URL indexed yesterday is no more working today! So how do I update solr indexes such that this particular blog doesn’t come in search results? Recrawling the site didn’t delete this record in solr is what I observed. I am using nutch 1.5.1 binary version.

Thanks
David

Sujit Pal said...

Nutch periodically recrawls sites from scratch (30 days is the default setting, and if you enable adaptive fetching it will gradually move higher or lower depending on how static it is. So when this happens, pages which end up with 404 should be removed (although I have never had a need for this myself, so can't say if this happens or not).

Another approach we did (this is before Nutch) was to go through our index and check for the page with a simple HEAD request. Each time the status came back as a 404 we would mark a database table - after 3 marks the page would be blacklisted so they didn't show up in the search results. But you would still not get the dead pages on sites which return a 200 status with a custom "not found" page.

Manali said...

Hello Sujit,

I am currently trying crawl the web using nutch 1.11 trunk version from https://github.com/apache/nutch

I am trying to use a particular property from the nutch-default.xml named:

http.agent.rotate
false

If true, instead of http.agent.name, alternating agent names are
chosen from a list provided via http.agent.rotate.file.




http.agent.rotate.file
agents.txt

File containing alternative user agent names to be used instead of
http.agent.name on a rotating basis if http.agent.rotate is true.
Each line of the file should contain exactly one agent
specification including name, version, description, URL, etc.




This is how I have modified my nutch-site.xml (not including other basic properties)


http.agent.rotate
true

If true, instead of http.agent.name, alternating agent names are
chosen from a list provided via http.agent.rotate.file.




http.agent.rotate.file
agents.txt

File containing alternative user agent names to be used instead of
http.agent.name on a rotating basis if http.agent.rotate is true.
Each line of the file should contain exactly one agent
specification including name, version, description, URL, etc.




plugin.includes
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library. Set parsefilter-naivebayes for classification based focused crawler.



This is is how my agents.txt looks like:
NutchTry1
NutchTry2
NutchTry3
NutchTry4
NutchTry5

and it is stored inside the runtime/local/conf folder.

But when i check my logs, it doesn't seem to change the agent name. Though protocol-http is activated via the plugin.includes property.

Could you please suggest what changes I could try or correct something that I may have configured incorrectly. I couldn't find any documentation.

Thanks,
Manali

Sujit Pal said...

Hi Manali, nothing jumps out at me here, things look correct. However, its been a while since I used Nutch and I am not familiar with the rotating agent functionality, so me not seeing anything wrong probably doesn't mean much. You will very likely get much better answers if you post this message (without change, actually, since the report is so detailed) the nutch-users list, maybe even from the author of protocol-http :-).

Nikhil said...

Hi Sujit,
I want to know if we can perform lexical analysis on crawled contents of nutch. I want to extract data from particular tags.

Sujit Pal said...

Hi Nikhil, metadata extraction from tags in the page could be done inline. I have a post here that does something similar, maybe that might help you.