关键词:古汉语; 层叠条件随机场; 数据稀疏; 断句; 句读标注
Method of sentence segmentation and punctuating for ancient Chineseliteratures based on cascaded CRF
ZHANG He WANG Xiao-dong YANG Jian-yu ZHOU Wei-dong3
(1. College of Computer & Information Technology, Henan Normal University, Xinxiang Henan 453007, China; 2. Beijing d-Ear Technologies Co., Ltd., Beijing 100085, China; 3. Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China)
Abstract:Data sparseness is a primary challenge in sentence segmentation and punctuating for ancient Chinese literatures using natural language processing technology. In order to overcome this difficulty, designed a 6-tag set and proposed a method based on cascaded conditional random fields. The main idea was as follows: basing on the 6-tag set, a low level model determined the boundaries of sentences according to observation sequence and a high level model punctuated sentences taking consideration of both observation sequence and low level’s results. Done close test and open test based on approximate 5M mixed corpus respectively. The F measure of sentence segmentation and punctuation were 96.48% and 91.35% respectively in close test, and those were 71.42% and 67.67% respectively in open test.......