NAME
Gungho - 高性能Webクãƒãƒ¼ãƒ©ãƒ¼ãƒ•ãƒ¬ãƒ¼ãƒ ワーク
SYNOPSIS
use Gungho;
Gungho->run($config);
DESCRIPTION
Gunghoã�¯é«˜æ€§èƒ½Webクãƒãƒ¼ãƒ©ãƒ¼ãƒ•ãƒ¬ãƒ¼ãƒ ワークã�§ã�™ã€‚高速ã�ªHTTP処ç�†ã‚’è¡Œã�„ã�¤ã�¤ã€� 機能拡張をã�—ã‚„ã�™ã�„よã�†ãƒ•ãƒ¬ã‚シブルã�ªæ§‹é€ を目指ã�—ã�¦é–‹ç™ºã�•ã‚Œã�¦ã�„ã�¾ã�™ã€‚
ç�¾åœ¨Gunghoã�¯Î²ç‰ˆã�§ã�™ã€‚機能的ï¼�仕様的ã�«ã‚‚比較的安定ã�—ã�¤ã�¤ã�‚ã‚Šã�¾ã�™ã�Œã€�ã�¾ã� 内部的ã�ªAPIç‰ã�¯å¤§å¹…ã�ªå¤‰æ›´ã�ŒåŠ ã‚�ã‚‹å�¯èƒ½æ€§ã�Œã�‚ã‚Šã�¾ã�™ã�®ã�§ã�”注æ„�ã��ã� ã�•ã�„。
Gunghoをインストール�る�自動的�以下�機能�使�るよ���り��:
- イベント型��期エンジン
-
Gunghoã�¯POEã€�Danga::Socketã€�IO::Asyncç‰ã‚’ベースã�«ã�—ã�Ÿé�žå�ŒæœŸã‚¨ãƒ³ã‚¸ãƒ³ã‚’使ã�„ クãƒãƒ¼ãƒ«ã‚’è¡Œã�„ã�¾ã�™ã€‚ã�‚ã�ªã�Ÿã�®ãƒ‹ãƒ¼ã‚ºã�«ã�‚ã�£ã�Ÿã‚¨ãƒ³ã‚¸ãƒ³ã‚’é�¸ã‚“ã�§ã��ã� ã�•ã�„。
- ��期DNS解決
-
HTTP通信ã�¯é�žå�ŒæœŸã�§è¡Œã‚�れるã�ªã‚‰ã‚‚ã�¡ã‚�ã‚“DNS通信もé�žå�ŒæœŸã�§è¡Œã�ˆã�¾ã�™ã€‚ Gunghoã�¯DNS解決をã�—ã�¦ã�„る間もブãƒãƒƒã‚¯ã�›ã�šã�«ä»–ã�®å‡¦ç�†ã‚’進ã‚�られã�¾ã�™ã€‚
- 自動robots.txt処�
-
å…¨ã�¦ã�®ã‚¯ãƒãƒ¼ãƒ©ãƒ¼ã�¯robots.txtã‚’æ£ã�—ã��処ç�†ã�—ã€�ç¦�æ¢ã�•ã‚Œã�¦ã�„ã‚‹URLã�«ã�¯ã‚¢ã‚¯ã‚»ã‚¹ ã�—ã�ªã�„よã�†ã�«ã�™ã‚‹ã�¹ã��ã�§ã�™ã€‚Gunghoã�¯ã�“ã�®robots.txt処ç�†ã�¨ã�„ã�†æ¯”較的é�¢å€’ã�ª 処ç�†ã‚’自動的ã�«è¡Œã�„ã�¾ã�™ã€‚memcachedã�¨ã�¨ã‚‚ã�«ä½¿ç”¨ã�™ã‚Œã�°åˆ†æ•£ç’°å¢ƒã�§ã‚‚使用å�¯èƒ½ã�§ã�™ã€‚
- メタタグ内ã�®ãƒãƒœãƒƒãƒˆãƒ‡ã‚£ãƒ¬ã‚¯ãƒ†ã‚£ãƒ–処ç�†
-
ãƒãƒœãƒƒãƒˆãƒ‡ã‚£ãƒ¬ã‚¯ãƒ†ã‚£ãƒ–ã�¯HTMLã�®METAタグ内ã�«åŸ‹ã‚�è¾¼ã�¾ã‚Œã�Ÿãƒãƒœãƒƒãƒˆç”¨ã�®åˆ¶å¾¡æ§‹æ–‡ ã�§ã�™ã€‚Gunghoã�§ã�¯ã�“ã�®ãƒ‡ã‚£ãƒ¬ã‚¯ãƒ†ã‚£ãƒ–を自動的ã�«ãƒ‘ースã�—ã€�ユーザーã�Œæ‰±ã�ˆã‚‹ã‚ˆã�†ã�« ã�—ã�¾ã�™ã€‚
- スãƒãƒƒãƒˆãƒªãƒ³ã‚°
-
クãƒãƒ¼ãƒ«å¯¾è±¡ã�¨ã�ªã�£ã�¦ã�„るサイトã�«é�Žåº¦ã�®è² è�·ã‚’ã�‹ã�‘ã�¦ã‚µã‚¤ãƒˆã‚’è�½ã�¨ã�—ã�¦ã�¯å…ƒã‚‚å�ã‚‚ ã�‚ã‚Šã�¾ã�›ã‚“。スãƒãƒƒãƒˆãƒªãƒ³ã‚°ãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã‚’使ã�†äº‹ã�«ã‚ˆã�£ã�¦Gunghoã�§ã�¯ãƒªã‚¯ã‚¨ã‚¹ãƒˆæ•°ã‚’ 絞り込む事ã�Œå�¯èƒ½ã�§ã�™ã€‚
- 内部å�‘ã�‘IPç¦�æ¢
-
クãƒãƒ¼ãƒ«ã�—ã�¦ã�„るサイトã�®DNSã�®è¨å®šã�Œé–“é�•ã�£ã�¦ã�„ã�Ÿã‚Šã€�æ„�図的ã�«ã��ã�®ã‚ˆã�†ã�ªURLã‚’ 埋ã‚�込んã�§ã�‚ã�£ã�Ÿå ´å�ˆãƒªã‚¯ã‚¨ã‚¹ãƒˆã�Œè‡ªåˆ†ã�®å†…部ãƒ�ットワークã�®IPアドレスã�«å�‘ã�„ã�¦ã�—ã�¾ã�„ DoSを引ã��èµ·ã�“ã�™å�¯èƒ½æ€§ã�Œã�‚ã‚Šã�¾ã�™ã€‚ã�“ã�®ã‚»ã‚ュリティリスクをGunghoを監視ã�—ã�¾ã�™ã€‚
- ã‚ャッシュ
-
Catalystã‚ャッシュã�®ã‚ˆã�†ã�ªã‚ャッシュを使ã�„ã�Ÿã�„å ´å�ˆã�¯Cacheコンãƒ�ーãƒ�ントを 使用ã�™ã‚‹ã� ã�‘ã�§ãƒ—ãƒã‚°ãƒ©ãƒ 内ã�‹ã‚‰ã‚ャッシュを扱ã�ˆã‚‹ã‚ˆã�†ã�«ã�ªã‚Šã�¾ã�™ã€‚
- Web::Scraperサ�ート
-
Web::Scraperã‚’Gungho内ã�‹ã‚‰ç°¡å�˜ã�«æ‰±ã�ˆã‚‹ã‚ˆã�†ã�«ã�—ã�¦ã�„ã�¾ã�™ (ã�“ã�®æ©Ÿèƒ½ã�¯ç�¾åœ¨ã�¾ã� 安定稼åƒ�ã�—ã�¦ã�„ã�¾ã�›ã‚“)
- リクエストãƒã‚°
-
RequestLogプラグインを使用ã�™ã‚‹ã�“ã�¨ã�«ã‚ˆã�£ã�¦è‡ªå‹•çš„ã�«å�–å¾—ã�•ã‚Œã�¦ã�„ã��URLã‚’ ãƒã‚°ã�—ã�¦è¡Œã��事ã�Œã�§ã��ã�¾ã�™ã€‚
æ´å�²
First there were a bunch of scripts that used scrape a bunch of RSS feeds. Then I got tired of writing scripts, so I decided a framework is the way to go, and Xango was born.
Xango was my first attempt at trying to harness the full power of event-based framework. It was fast. It wasn't fun to extend. It had a nightmare-ish way to deal with robots.txt.
Couple of more attempts later, more inspirations and lessons learned from Catalyst, Plagger, DBIx::Class, Gungho was born.
Since its inception, Gungho has been in successfully used as crawlers that fetch hundreds of thousands of urls to a few million urls per day.
PLEASE READ BEFORE USE
Gungho is designed to so that it can handle massive amount of traffic. If you're careful enough with your Provider and Handler implementation, you can in fact hit millions of URL with this crawler.
So PLEASE DO NOT LET IT LOOSE. DO NOT OVERLOAD your crawl targets. You are STRONGLY advised to use Gungho::Component::Throttle to throttle your fetches.
Also PLEASE CHANGE THE USER AGENT NAME OF YOUR CRAWLER. If you hit your targets hard with the default name (Gungho/VERSION X.XXXX), it will look as though a service called Gungho is hitting their site, which really isn't the case. Whatever it is, please specify at least a simple user agent in your config
STRUCTURE
Gungho is comprised of three parts. A Provider, which provides Gungho with requests to process, a Handler, which handles the fetched page, and an Engine, which controls the entire process.
There are also "hooks". These hooks can be registered from anywhere by invoking the register_hook() method. They are run at particular points, which are specified when you call register_hook().
All components (engine, provider, handler) are overridable and switcheable. However, do note that if you plan on customizing stuff, you should be aware that Gungho uses Class::C3 extensively, and hence you may see warnings about the code you use.
Gungho�間���使�方
Gunghoã�¯è†¨å¤§ã�ªæ•°ã�®URLã‚’æ�’常的ã�«å�–å¾—ã�™ã‚‹ã�Ÿã‚�ã�«è¨è¨ˆã�•ã‚Œã�¦ã�„ã�¾ã�™ã€‚ã‚‚ã�— Gunghoã‚’ã�²ã�¨ã�¤ã�®URLã€�ã‚‚ã�—ã��ã�¯ã�²ã�¨ã�¤ã�®ãƒ›ã‚¹ãƒˆã�«å¯¾ã�—ã�¦æ‰±ã�†ã�®ã�§ã�‚ã‚Œã�°æ³¨æ„�ã‚’ è¦�ã�—ã�¾ã�™ã€‚
上記ã�®ã‚ˆã�†ã�ªç’°å¢ƒã�§Gunghoã‚’å‹•ã�‹ã�™å ´å�ˆã�¯å��分ã�ªãƒ‘フォーマンスã�Œå‡ºã�›ã�ªã�„å�¯èƒ½æ€§ã�Œ 高ã��ã€�ã�²ã‚‡ã�£ã�¨ã�™ã‚‹ã�¨LWP::UserAgentã�®ã‚ˆã�†ã�ªãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã‚’使ã�£ã�Ÿã�»ã�†ã�Œè‰¯ã�„ã�‹ã‚‚ ã�—ã‚Œã�¾ã�›ã‚“。
ã‚‚ã�¡ã‚�ã‚“LWP::UserAgentã�«ã�¯å˜åœ¨ã�—ã�ªã�„Gunghoã�®æ©Ÿèƒ½ã‚’使用ã�™ã‚‹ã�Ÿã‚�ã�«Gunghoã‚’ 使ã�†ã�®ã‚‚よã�„ã�‹ã‚‚知れã�¾ã�›ã‚“ã�Œã€�ãƒ�ューニングã�Œå¿…è¦�ã�§ã�‚ã‚‹ã�“ã�¨ã‚’èª�è˜ã�—ã�¦ã�„ã�¦ ã��ã� ã�•ã�„
GLOBAL CONFIGURATION OPTIONS
- debug
-
--- debug: 1
Setting debug to a non-zero value will trigger debug messages to be displayed.
COMPONENTS
Components add new functionality to Gungho. Components are loaded at startup time from the config file / hash given to Gungho constructor.
Gungho->run({
components => [
'Throttle::Simple'
],
throttle => {
max_interval => ...,
}
});
Components modify Gungho's inheritance structure at run time to add extra functionality to Gungho, and therefore should only be loaded before starting the engine.
Please refer to each component's document for details
- Gungho::Component::Authentication::Basic
- Gungho::Component::BlockPrivateIP
- Gungho::Component::Cache
- Gungho::Component::RobotRules
- Gungho::Component::RobotsMETA
- Gungho::Component::Scraper
- Gungho::Component::Throttle::Domain
- Gungho::Component::Throttle::Simple
INLINE
If you're looking into simple crawlers, you may want to look at Gungho::Inline,
Gungho::Inline->run({
provider => sub { ... },
handler => sub { ... }
});
See the manual for Gungho::Inline for details.
PLUGINS
Plugins are different from components in that, whereas components require the developer to explicitly call the methods, plugins are loaded and are not touched afterwards.
Please refer to the documentation of each plugin for details.
- RequestLog
- Statistics
HOOKS
Currently available hooks are:
engine.send_request
engine.handle_response
METHODS
component_base_class
Used for Class::C3::Componentised
CODE
コード�Google Code�管��れ����。レ�ジトリ�以下URL��管�れ����
http://gungho-crawler.googlecode.com/svn/trunk
AUTHOR
Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
CONTRIBUTORS
LICENSE
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 3:
Non-ASCII character seen before =encoding in '高性能Webクãƒãƒ¼ãƒ©ãƒ¼ãƒ•ãƒ¬ãƒ¼ãƒ ワーク'. Assuming CP1252