Google Hadoop 之nutch参数研究

管理员 · 发表于 2013-9-13 14:20:52

Nutch allows to crawl a site or a collection of sites.If your objective is to simply crawl the content once, it is fairly easy. Butif you want to continuously monitor a site and crawl updates, it can be harder.Harder because the Nutch documentation does not have many details about that.

After a bit of digging, I found that Nutch offers an Adaptive FetchSchedule class that can be used for that purpose. To understand how this classworks, let’s recap how Nutch manage crawl.

Nutch maintains a record on file of all the urls that it has encounteredwhile crawling. This record is called the crawl db. Initially, the crawl db isbuild from a list of urls provided by the user using the inject command. Animportant concept in Nutch is the generate/fetch/update process. The generatecommand looks up in the crawl db for all the urls due for fetch and regroupthem in a segment. An url is due for fetch if it is either a new url or if itis time to re-crawl it. More on that later. The fetch command will, well, fetchon the web all the urls of the segment. After that, the update command will addthe results of the crawling (stored in the segment) into the crawl db. Each urlcrawled will be updated to indicate the fetch time and the next scheduled fetch.New urls discovered will also be added and marked as not fetched.

By default, Nutch will set the next scheduled fetch of a page to be thefetch time + a constant interval. The default value is 30 days, but it can bechanged in the file nutch-site.xml via the db.fetch.interval.default propertyto whatever value. On a later generate call, if the time has come, the url willbe added to a segment and re-crawled. This default behavior can be acceptableif roughly all pages of a site change at approximately the same rhythm. But ifthe site being crawled contains a lot of pages that almost never change, youwould probably want Nutch to visit these pages less often and concentrate onthe one that changes frequently. But it is not possible to do that with thedefault fetch schedule that uses the same constant interval for each url.

Enter the Adaptive Fetch schedule. This fetch schedule will adapt to therhythm of changes of a page and set the next schedule time accordingly. When anew url is added to the crawl db, it is initially set to be re-fetched at thedefault interval. The next time the page is visited, the Adaptive Fetchschedule will increase the interval before the next fetch if the page has notchanged and decreased it if the page has changed. Note that a maximum and aminimum interval is defined in the configuration. The interval will never belonger than that maximum or smaller than the minimum. So after a while, thepages that changes often will tend to be visited more than the one that doesnot.

db.fetch.schedule.class

The implementation of fetch schedule

db.fetch.interval.default

The default number of seconds between re-fetches of a page

db.fetch.schedule.adaptive.min_interval

The min number of seconds between re-fetches of a page

db.fetch.schedule.adaptive.max_interval

The max number of seconds between re-fetches of a page

db.fetch.schedule.adaptive.inc_rate

If a page is unmodified, the interval before the next fetch will be increased by this rate

db.fetch.schedule.adaptive.dec_rate

If a page is modified, the interval before the next fetch will be decreased by this rate

db.fetch.schedule.adaptive.sync_delta

If true, try to synchronize with the time of page change by shifting the next fetchTime by a fraction (sync_rate) of the difference between the last modification time, and the last fetch time

If a page was modified, the Adaptive Fetch schedule will store the lastfetch time as the last modification time. Nutch will use that information inthe If-Modified-Since header of the http request of the next fetch. If the webserver supports this and the page has not changed since, it will only returns a304 code. Note that there is a bug in Nutch 1.0 that prevents this to workproperly. I have reported the bug and itwill be fixed for Nutch 1.1. You can use the trunk in the meantime.

How does Nutch can detect if a page has changed or not? Each time a pageis fetched, Nutch computes a signature for the page. At the next fetch, if thesignature is the same (or if a 304 is returned by the web server because of theIf-Modified-Since header), Nutch can tell if the page was modified or not. Bydefault the signature of a page is built not only with its content, but alsowith the http headers returned with the page. So even if the content of a pagehas not changed, if an http header is not the same (like an etag or a date),the signature changes. To solve that problem, there is the TextProfileSignatureclass. It is designed to look only at the text content of a page to build thesignature. To use it, you need to set the db.signature.class property toorg.apache.nutch.crawl.TextProfileSignature.

A word about the setting db.fetch.schedule.adaptive.sync_delta. Iset it to false for my crawls because I have not been able to really understandwhat it is good for. As I described earlier, the next fetch time is computed byadding a dynamic interval to the last featch time. But with this setting set totrue, the interval is applied to a reference time which is a time locatedbetween the last fetch time and the last modification time. If someone canenlighten me about the usefulness of this, please do!

NUTCH 允许去爬取一个站点或收集一个站点.如果你试图去简单的爬取内容,它是相当容易的.但是你如果想不断的监控一个站点或者爬取更新,它会比较困难.因为nutch的文档没有详细解说这些.

在一经历了几次挖掘.我发现nutch提供一个实用的爬取周期接口类.为了去理解这些类如何工作,让我们概述以下nutch怎样管理爬取过程的

Nutch 会为所有爬取的urls保持一条记录,这个记录叫爬取数据库.最初,这个爬取的数据库主要是由用户的urls清单提供的. 建立,爬取,更新进程是nutch一个重要的概念。建立指令查找爬取数据库的所有urls以便去引用或者在节点中重新将它们归类。一个url如果从来没爬取或者已经到时间重新爬取了将被再次引用。在那以后更多的情况。调用命令将很好的收集所有的页面从这个节点的链接中。然后，这个更新命令将添加这些爬取的结果到爬取数据库中（或者存储在节点中）。每个url爬取后将被更新-标示引用的时间和下个引用周期。新发现的url将也加入和标记到数据库中不再爬取。

默认的,nutch 将设置引用时间+时间间隔作为一个页面的下个引用周期。默认的值是30天，但是他可以在nutch-site.xml通过db.fetch.interval.default参数去设置值去改变。在一次引用后，如果周期到了，这个被引用的url将被添加到节点中重新爬取。这个默认的行为能在整体上让一个网站的所有页面拥有一个合理的更改周期。但是如果这个网站在爬取时包含的很多页面总是从来不更新，而你想让nutch不那么平凡访问这些页面同时专注于更新评判的网站。而且不可能设定成一样的默认的引用周期给每个url。

通过周期引用接口。引用周期奖杯对应到频繁变更的网页上和设定上下一个周期。当新的url被加到爬取数据库时，他将立刻被爬取在默认的周期内。当下一次这个页面被访问，如果该页面没有变更，接口引用排定将在下次引用前增加间隔，如果该页面已经变更，就减少间隔。需要注意的是最大和最小间隔被定义在参数中。这个间隔将不会超过最大或者小过最小。所以此时，这些页面如果经常变化将被比没变化更能访问到。

db.fetch.schedule.class

引用排定接口类

db.fetch.interval.default

默认的引用周期

db.fetch.schedule.adaptive.min_interval

动态排定引用不低于最小间隔单位

db.fetch.schedule.adaptive.max_interval

动态排定引用不高于最大的间隔单位

db.fetch.schedule.adaptive.inc_rate

如果未变更,这个间隔增长的速率

db.fetch.schedule.adaptive.dec_rate

如果发生了变更，这个间隔递减的速率