Динамический Конкорданс к тексту А. С. Пушкина

Introduction

«Конкорданс к тексту А. С. Пушкина» (КТП) is a dynamic generator of KWIC (Key Word in Context) concordance table to the complete text of Aleksandr Pushkin. KWIC concordance is useful for analysis of usage of words.

Our Corpus (the source of concordance) of Pushkin's works is based on Русская Виртуальная Библиотека (РВБ): http://www.rvb.ru/pushkin/ i. e., digital data of the edition: Пушкин, А. С., Полное собрание сочинений в 10-ти томах. М. 1959.

КТП's features:

generate KWIC on Lemmatized forms (the keys of Database are lemmas converted automatically with Russian Lemmatizer programm). Non Lemmatized form (the keys of Database are constructed in form as the words appear in Pushkin's text) is also available.
allow flexible specification for KWIC by Regular expression.
support Adjacent Search operation.
enable to view Pushkin's text including your hit word from the Position data of the Corpus.
built in Russian Pseudo Keyboard for the users without Cyrillic keyboard. See the tab «Русская Клавиатура».
high performance on UNIX Shared Memory Database and Ajax Asynchronous Web Communication platform.

КТП is written in C++ programming language with C++ Class Libraries: Wt C++ Web Toolkit, UTF-8 Russian Lemmatizer, Boost (Regex, Interprocess, Locale), ICU (International Components for Unicode) and tested on FreeBSD 8.2-RELEASE and Mac OS X Snow Leopard operating systems. КТП is running with Apache22 FastCGI module on FreeBSD 8.2-RELEASE operating system.

Expression

Enter expressions (character strings to represent the condition of words for KWIC generation) onto the text field continuous with Выражения and click поиск, you will get a KWIC table on the specified expressions in the area of the tab «KWIC».

Fig. 1: Text-box for expressions

Expressions in our concordance are different in concept from key words for search engines. In the latter the main subject of key words is the “documents”. In the former - the “words” which would match the specified condition.

Regular expression is available for expressions. About Regular expression see http://en.wikipedia.org/wiki/Regular_expression. However, expression with no special characters nor meta characters (.*+?^$[-]()) does not work as Regular expression, but performs perfect matching with higher performance than Regular expression.

Here are examples of expressions and their matching words:

крас.*: красавица, краска, красный, покраснеть, прекрасный,..

^красн.* (that means “words with красн at the beginning”): красный, краснеть, красноватый,..

^крас.*а$: красавица, краска,..

^краска$: краска (only)

краска(perfect matching): краска (its performance is higher than ^краска$)

^п[её]стр.*: пестрый, пёстрый, пестреть, пестрота,..

^п[её]стр[^о]*: пестрый, пёстрый, пестреть,.. (does not match пестрота because of specification [^о]: “^” noted within “[]” means NOT)

Adjacent Search operation

Adjacent Search operation shows the KWIC of the word distant within the specified distance of the word or line from its pairs. Specification format of Adjacent Search operation: TargetExpression<W|Ln>ComparedExpression. W: Word operation. L: Line operation. n: distance decimal value between TargetExpression and ComparedExpression (0 < n < 1000). A<W5>B will generate a KWIC table of expression A distant within 5 words from expression B. C<L3>D - KWIC of expression C distant within 3 lines from expression D. We regard Lines as стихи on verse, meanwhile, as paragraphs on prose. Distance 0 is not allowed. Max value of distance is 1000.

For instance, the expression ^п[её]стр.*<W3>при.* gives a KWIC table of the words which match ^п[её]стр.* and appear distant within 3 words from the words matching condition при.*. Fig. 2 shows the KWIC for expression: ^п[её]стр.*<W3>при.*

Fig. 2: Adjacent Search operation, example 1.

Cf. the result of the inverted condition: при.*<W3>^п[её]стр.*. See Fig. 3.

Fig. 3: Adjacent Search operation, example 2.

Restriction of natural language processing

Ensure that our lemmatizing is not always right.

In automatic lemmatization of Russian words КТП uses UTF-8 Lemmatizer Library (See: Лемматизатор европейских языков http://lemmatizer.org/).

The automatism in processing natural language often makes mistakes in selecting lemma. Our Lemmatizer confounds, for instance, сам and самый. How would you see which the right lemma of какая is: какать or какой?

From a number of candidate forms we select a lemma in accordance with the following priority in word class (part of speech):

priority 5 (highest): Существительное, Местоимение
priority 4: Прилагательное, Числительное, Порядковое Числительное
priority 3: Глагол, Инфинитив, Краткое Прилагательное, Наречие, Предикатив
priority 2: Деепричастие, Краткое Причастие, причастие
priority 1: Междометие, Предлог, Союз, Частица

According to this priority rule, our selection for the lemma of какая is какой (priority 5), but not какать (priority 3).

Between candidates of the same priority the shorter one is selected.

Options

Possible to pick up literary genres for generation of KWIC in the tab «Опции». See Fig. 4 and the tab «Жанровые Разделы». The corpus includes not only Pushkin's literary works but also documents like notes and letters in his private life. Variants are not included.

Fig. 4: Genres option

Available Word Database selection: Lemmaitzed form (keys of database are lemmatized) or Non Lemmatized form (Keys of database are left as appeared in Pushkin's texts). See Fig. 5.

Fig. 5: Word Database selection.

When Lemmaitzed form is selected selection to lemmatize your input expressions or not is available (See Fig. 6):

ON: user's input люблю searches DB key in form любить (lemmatized by the system)

OFF: user's input люблю searches DB key in form люблю (as user input)

Fig. 6: Input lemmatizing option.

KWIC information

KWIC shows your expression and the hit word and its number of appearance times at first.

Context shows the preceding, current and following lines of the hit word. Rose background color indicates the hit line. Lines, however, mean стихи on verse, meanwhile, paragraphs on prose. Hit words are decorated with emphasizing tags. Available option to limit context size of the text preceding and following key words, because context data are often long and long. See Fig. 7.

The information on the right side that consists of numeric data is the Position of the hit line: genre (2 digits):number of the work(4 digits):line number of the hit word (6 digits). When your mouse pointer is placed upon the position text the title of the piece is shown. See Fig. 8.

The Position is linked to the Pushkin's text Web page and points the hit line so that users could confirm the appropriate fragments. See Fig. 9.

Fig. 7: Context size limit option.

Fig. 8: Hit line position.

Fig. 9: Linked Pushkin's text.

Russian Pseudo Keyboard

For the users that do not have means to input Russian characters Russian Pseudo Keyboard is available. See the tab «Русская Клавиатура» and Fig. 10. It has 2 types of Keyboard layout: ЙЦУКЕН(Cyrillic standard layout) and ЯВЕРТЫ(Phonetic layout).

Fig. 10: Russian Pseudo Keyboard.

Policy of Corpus editing

Our Corpus of Pushkin's text is based on the РВБ Web contents. We corrected the misprints caused by mixed composition with Latin and Cyrillic characters in a word like словo (the second o is a Latin code character).

Following text structures of document are included in the corpus: Main text, Preface, Afterword, Argument, Citation by Pushkin and its author's name, Stage direction, Person's name (role) of drama, Commentary by Pushkin.

Following structures are not included: Title, Subtitle, Chapter name, Stanza name, Text by editors (commentary or translation of foreign languages).

Generally, contemporary editions of Pushkin's text are based on the Russian orthography of Post-revolution. However, in Pushkin's historical works there are Old Slavonic or Ukrainian fragments like «Сей камень возопіетъ о насъ ти вѣщати,..» (Примечания Пушкина к поэме «Полтава»). In these cases we left the old characters: ѣ, і (ѵ, ѳ are not found in the edition of Pushkin's works).

Bug report

If you encounter any bugs on КТП inform us by mails to isao@yasuda.homeip.net.

History

May 1, 2012: Initial test version. Corpus «Евгений Онегин» only.
May 4, 2012: Corpus of Pushkin's Complete Works.
May 13, 2012: Russian Pseudo Keyboard support.
May 19, 2012: Tab interface.
May 28, 2012: Corpus bugs fixed: (1) Argument texts were included. (2) Cyrillic old characters: ѣ, і were restored.
June 2, 2012: Rebuilt Title DB.
Nov. 3, 2013: Supported short report in context.