Выражения 
Pushkin Profile

Introduction

«Конкорданс к тексту А. С. Пушкина» (КТП) is a dynamic generator of KWIC (Key Word in Context) concordance table to the complete text of Aleksandr Pushkin. KWIC concordance is useful for analysis of usage of words.

Our Corpus (the source of concordance) of Pushkin's works is based on Русская Виртуальная Библиотека (РВБ): http://www.rvb.ru/pushkin/ i. e., digital data of the edition: А. С. Пушкин - Полное собрание сочинений в 10-ти томах. М. 1959.

КТП's features:

  1. generate KWIC on Lemmatized form (the keys of Database are lemmas converted automatically with Russian Lemmatizer programm). Non Lemmatized form (the keys of Database are constructed in form as the words appear in Pushkin's text) is also available.
  2. allow flexible specification for KWIC by Regular expression.
  3. support Adjacent Search operation.
  4. enable to view Pushkin's text including your hit word from the Position data of the Corpus.
  5. built in Russian Pseudo Keyboard for the users without Cyrillic keyboard. See the tab «Русская Клавиатура».
  6. achieve high performance by UNIX Shared Memory Database system and Ajax Asynchronous Web Communication platform.

КТП is written in C++ programming language, compiled with Clang++ / LLVM and C++ Class Libraries: Wt C++ Web Toolkit, UTF-8 Russian Lemmatizer, Boost (Regex, Interprocess, Locale), ICU (International Components for Unicode) and tested on FreeBSD 10.1-RELEASE and Mac OS X Mavericks operating systems. КТП is running with Apache24 FastCGI module on FreeBSD 10.1-RELEASE operating system.

Expression

Enter expressions (character strings to represent the condition of words for KWIC generation) onto the text field continuous with Выражения and click поиск, you will get a KWIC table on the specified expressions in the area of the tab «KWIC».

text-box for expressions
Fig. 1: Text-box for expressions

Expressions in our concordance are different in concept from key words for search engines. In the latter the main subject of key words is the “documents”. In the former - the “words” which would match the specified condition.

Regular expression is available for expressions. About Regular expression see http://en.wikipedia.org/wiki/Regular_expression. However, expression with no special characters nor meta characters (.*+?^$[-]()) does not work as Regular expression, but performs perfect matching with higher performance than Regular expression.

Here are examples of expressions and their matching words:

  • крас.*: красавица, краска, красный, покраснеть, прекрасный,..
  • ^красн.* (that means “words with красн at the beginning”): красный, краснеть, красноватый,..
  • ^крас.*а$: красавица, краска,..
  • ^краска$: краска (only)
  • краска(perfect matching): краска (its performance is higher than ^краска$)
  • ^п[её]стр.*: пестрый, пёстрый, пестреть, пестрота,..
  • ^п[её]стр[^о]*: пестрый, пёстрый, пестреть,.. (does not match пестрота because of specification [^о]: “^” noted within “[]” means NOT)

Adjacent Search operation

Adjacent Search operation shows the KWIC to the words distant within the specified distance of words or lines from its pairs. Specification format of Adjacent Search operation:

        TargetExpression<W|Ln>ComparedExpression
   where
   - TargetExpression: Expression to be retrieved for KWIC.
   - ComparedExpression: Expression to be measured in distance from TargetExpression.
   - W: Word operation.
   - L: Line operation.
   - n: distance decimal value between TargetExpression and ComparedExpression (0<n<1000).

A<W5>B will generate a KWIC table of expression A distant within 5 words from expression B. C<L3>D - KWIC of expression C distant within 3 lines from expression D. We regard Lines as стихи on verse, meanwhile, as paragraphs on prose. Distance 0 is not allowed. Max value of distance is 999.

For instance, the expression ^п[её]стр.*<W3>при.* gives a KWIC table of the words which match ^п[её]стр.* and appear distant within 3 words from the words matching condition при.*. Fig. 2 shows the KWIC for expression: ^п[её]стр.*<W3>при.*

Adjacent Search operation 1
Fig. 2: Adjacent Search operation, example 1.

Cf. the result of the inverted condition: при.*<W3>^п[её]стр.*. See Fig. 3.

Adjacent Search operation 2
Fig. 3: Adjacent Search operation, example 2.

Restriction of natural language processing

Ensure that our lemmatizing is not always right.

In automatic lemmatization of Russian words КТП uses UTF-8 Lemmatizer Library (See: Лемматизатор европейских языков, http://lemmatizer.jooko.net/).

The automatism in processing natural language often makes mistakes in selecting lemma. Our Lemmatizer confounds, for instance, сам and самый. How would you see which the right lemma of какая is: какать or какой?

From a number of candidate forms we select a lemma in accordance with the following priority in word class (part of speech):

  1. priority 5 (highest): Существительное, Местоимение
  2. priority 4: Прилагательное, Числительное, Порядковое Числительное
  3. priority 3: Глагол, Инфинитив, Краткое Прилагательное, Наречие, Предикатив
  4. priority 2: Деепричастие, Краткое Причастие, Причастие
  5. priority 1: Междометие, Предлог, Союз, Частица

According to this priority rule, our selection for the lemma of какая is какой (priority 5), but not какать (priority 3).

Between candidates of the same priority the shorter one is selected.

Options

Possible to pick up literary genres for generation of KWIC in the tab «Опции». See Fig. 4 and the tab «Жанровые Разделы». The corpus includes not only Pushkin's literary works but also documents like notes and letters in his private life. Variants are not included.

genres option
Fig. 4: Genres option

Available Word Database selection: Lemmaitzed form (keys of database are lemmatized) or Non Lemmatized form (Keys of database are left as appeared in Pushkin's texts). See Fig. 5.

Word Database selection
Fig. 5: Word Database selection.

When Lemmaitzed form is selected selection to lemmatize your input expressions or not is available (See Fig. 6):

  • ON: user's input люблю searches DB key in form любить (lemmatized by the system)
  • OFF: user's input люблю searches DB key in form люблю (as user input)
Input lemmatizing
Fig. 6: Input lemmatizing option.

KWIC information

  • KWIC shows your expression and the hit word and its number of appearance times at first.
  • Context shows the preceding, current and following lines of the hit word. Rose background color indicates the hit line. Lines, however, mean стихи on verse, meanwhile, paragraphs on prose. Hit words are decorated with emphasizing tags. Available option to limit context size of the text preceding and following key words, because context data are often long and long. See Fig. 7.
  • The information on the right side that consists of numeric data (e.g. 03:0836:000011) is the Position Indicator of the hit line. Format - Genre ID [2 digits] : Number ID of the work [4 digits] : Line number of the hit word [6 digits]. When your mouse pointer will be placed over the position indicator text the title of the piece will be shown. See Fig. 8.
  • The Position Indicator is linked to the Pushkin's text Web page and points the hit line so that users could confirm the appropriate fragments. See Fig. 9.
Context size limit option
Fig. 7: Context size limit option.
Hit line position
Fig. 8: Hit line position.
Linked Pushkin's text
Fig. 9: Linked Pushkin's text.

Russian Pseudo Keyboard

For the users that do not have means to input Russian characters Russian Pseudo Keyboard is available. See the tab «Русская Клавиатура» and Fig. 10. It has 2 types of Keyboard layout: ЙЦУКЕН (Cyrillic standard layout) and ЯВЕРТЫ (Phonetic layout).

Russian Pseudo Keyboard
Fig. 10: Russian Pseudo Keyboard.

Policy of Corpus editing

Our Corpus of Pushkin's text is based on the РВБ Web contents. We corrected the misprints caused by mixed composition with Latin and Cyrillic characters in a word like словo (the second o is a Latin code character).

Following text structures of document are included in the corpus: Main text, Preface, Afterword, Argument, Citation by Pushkin and its author's name, Stage direction, Person's name (role) of drama, Commentary by Pushkin.

Following structures are not included: Title, Subtitle, Chapter name, Stanza name, Text by editors (commentary or translation of foreign languages).

Generally, contemporary editions of Pushkin's text are based on the Russian orthography of Post-revolution. However, in Pushkin's historical works there are Old Slavonic or Ukrainian fragments like «Сей камень возопіетъ о насъ ти вѣщати,..» (Примечания Пушкина к поэме «Полтава»). In these cases we left the old characters: ѣ, і (ѵ, ѳ are not found in the edition of Pushkin's works).

Bug report

If you encounter any bugs on КТП inform us by E-mails to isao@yasuda.homeip.net.

History

May1,2012
Ver. 1.0rc. Initial test version. Corpus «Евгений Онегин» only.
May4,2012
Corpus of Pushkin's Complete Works.
May13,2012
Russian Pseudo Keyboard support.
May19,2012
Tab interface.
May28,2012
Corpus bugs fixed: (1) Argument texts were included. (2) Cyrillic old characters: ѣ, і were restored.
June2,2012
Rebuilt Title DB.
Nov.3,2013
Supported short report in context.
June6,2014
Ver. 1.0. Rebuilt on FreeBSD 9.2-RELEASE.
Jan.31,2015
Revised the link to “Lemmatizer.org”.
Feb.18,2015
Recompiled with Clang++ / LLVM 3.3, Wt C++ Web Toolkit 3.3.3.
July19,2015
Rebuilt on FreeBSD 10.1-RELEASE amd64. Recompiled with Clang++ / LLVM 3.4.1, Wt C++ Web Toolkit 3.3.4.