关键词:分布式爬虫; 任务调度; 负载均衡; 网络测量; 全局网络定位
GNP-based scheduling strategy for distributed crawling
LIU Shuang1, JIANG Chun-xiang2, ZHANG Wei-zhe1, LI Dong1, ZHANG Hong3
(1.School of Computer Science & Technology, Harbin Institute of Technology, Harbin 150001, China; 2.Heilongjiang Branch of National Computer Network Emergency Response Technical Team/Coordination center of China, Harbin 150001, China; 3.National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China)
Abstract:In order to solve task scheduling and load balancing problems of distributed search engines, this paper proposed a GNP-based scheduling strategy for distributed crawling and a load balancing method. Adopted internet distance estimating mechanism as a replacement for large-scale network distance measurement, which not only improved response time of the system, but also reduced WAN pressure caused by the system. Through deploying crawling nodes at WANs, built a distributed search engine, and implemented several scheduling strategies. The online experiment shows great improvement in system’s performance. ......