Journal "Software Engineering"
a journal on theoretical and applied science and technology
ISSN 2220-3397

Issue N6 2015 year

Person Name Recognition in Newswire Text
I. V. Trofimov, Senior Research Assistant, e-mail: itrofimov@gmail.com, Ailamazyan Program Systems Institute of Russian Academy of Sciences, Pereslavl-Zalessky

Named entity recognition is an important part of present-day natural language processing. The paper is concerned with the problem of recognizing person mentions (in the form of proper names) in Russian-language newswire text. The goal of research is to determine what F-measure rates can be achieved for Russian with fairly simple dictionary-heuristic methods, the results serving as a starting point for further research. The method is based on the use of person first-name morphological dictionary (more than 12,000 names) and regular expressions describing name structure independently of context. The method was evaluated on a Russian-language newswire text corpus (more than 2,000 documents). The study showed that two simple regular expressions for full-name recognition, along with a procedure recognizing stand-alone surname reoccurrence, can yield an F-measure score in access of 94, for European names. As for non-European names, the F-measure value never came close to 80 (because of low recall), in spite of the fact that we used lists of common Muslim name parts, Chinese and Korean surname vocabulary, and took into account the inverted order (family name followed by the given name) typical of Chinese and Korean person names. Low recall is due to incompleteness of first-name vocabulary, inconsistent name transliteration, and some other factors. We believe that it is still possible to achieve higher F-measure scores (especially for Asian names) through mere vocabulary extension.

The proposed approach can find application in tasks like hypertext content enrichment (content linking, semantic enrichment, name highlighting), person-based document indexing, or be used as part of an integrated text analysis system.

Keywords: natural language processing, information extraction, named entity recognition, person name recognition, vocabulary of first names, vocabulary of family names, information extraction rules, annotated corpus, F-measure.
pp. 41–47