Nutch is open source web-search software. It builds on Lucene and Solr, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
Homepage POM file JAR file Javadoc'org.apache.nutch:nutch:2.0-dev'