«Конкорданс к тексту А. С. Пушкина» (КТП) is a dynamic generator of KWIC (Key Word in Context) concordance table to the complete text of Aleksandr Pushkin. KWIC concordance is useful for analysis of usage of words.
Our Corpus (the source of concordance)
of Pushkin's works is based on
Русская Виртуальная Библиотека
(РВБ):
http:
КТП's features:
КТП is written in C++ programming language, compiled with Clang++ / LLVM and C++ Class Libraries: Wt C++ Web Toolkit, UTF-8 Russian Lemmatizer, Boost (Regex, Interprocess, Locale), ICU (International Components for Unicode) and tested on FreeBSD 10.1-RELEASE and Mac OS X Mavericks operating systems. КТП is running with Apache24 FastCGI module on FreeBSD 10.1-RELEASE operating system.
Enter expressions (character strings to represent the condition of words for KWIC generation) onto the text field continuous with Выражения and click поиск, you will get a KWIC table on the specified expressions in the area of the tab «KWIC».
Expressions in our concordance are different in concept from key words for search engines. In the latter the main subject of key words is the “documents”. In the former - the “words” which would match the specified condition.
Regular expression is available
for expressions.
About Regular expression see
http:
Here are examples of expressions and their matching words:
Adjacent Search operation
shows the KWIC to the words distant within the specified distance
of words or lines from its pairs.
Specification format of Adjacent Search operation:
TargetExpression
where
- TargetExpression: Expression to be retrieved for KWIC.
- ComparedExpression: Expression to be measured in distance from TargetExpression.
- W: Word operation.
- L: Line operation.
- n: distance decimal value between
TargetExpression
and ComparedExpression
(0<n<1000).
A<W5>B will generate a KWIC table of expression A distant within 5 words from expression B. C<L3>D - KWIC of expression C distant within 3 lines from expression D. We regard Lines as стихи on verse, meanwhile, as paragraphs on prose. Distance 0 is not allowed. Max value of distance is 999.
For instance, the expression ^п[её]стр.*<W3>при.* gives a KWIC table of the words which match ^п[её]стр.* and appear distant within 3 words from the words matching condition при.*. Fig. 2 shows the KWIC for expression: ^п[её]стр.*<W3>при.*
Cf. the result of the inverted condition: при.*<W3>^п[её]стр.*. See Fig. 3.
Ensure that our lemmatizing is not always right.
In automatic lemmatization of Russian words КТП
uses UTF-8 Lemmatizer Library (See:
Лемматизатор европейских языков,
http:
The automatism in processing natural language often makes mistakes in selecting lemma. Our Lemmatizer confounds, for instance, сам and самый. How would you see which the right lemma of какая is: какать or какой?
From a number of candidate forms we select a lemma in accordance with the following priority in word class (part of speech):
According to this priority rule, our selection for the lemma of какая is какой (priority 5), but not какать (priority 3).
Between candidates of the same priority the shorter one is selected.
Possible to pick up literary genres for generation of KWIC in the tab «Опции». See Fig. 4 and the tab «Жанровые Разделы». The corpus includes not only Pushkin's literary works but also documents like notes and letters in his private life. Variants are not included.
Available Word Database selection: Lemmaitzed form (keys of database are lemmatized) or Non Lemmatized form (Keys of database are left as appeared in Pushkin's texts). See Fig. 5.
When Lemmaitzed form is selected selection to lemmatize your input expressions or not is available (See Fig. 6):
For the users that do not have means to input Russian characters Russian Pseudo Keyboard is available. See the tab «Русская Клавиатура» and Fig. 10. It has 2 types of Keyboard layout: ЙЦУКЕН (Cyrillic standard layout) and ЯВЕРТЫ (Phonetic layout).
Our Corpus of Pushkin's text is based on the РВБ Web contents. We corrected the misprints caused by mixed composition with Latin and Cyrillic characters in a word like словo (the second o is a Latin code character).
Following text structures of document are included in the corpus: Main text, Preface, Afterword, Argument, Citation by Pushkin and its author's name, Stage direction, Person's name (role) of drama, Commentary by Pushkin.
Following structures are not included: Title, Subtitle, Chapter name, Stanza name, Text by editors (commentary or translation of foreign languages).
Generally, contemporary editions of Pushkin's text are based on the Russian orthography of Post-revolution. However, in Pushkin's historical works there are Old Slavonic or Ukrainian fragments like «Сей камень возопіетъ о насъ ти вѣщати,..» (Примечания Пушкина к поэме «Полтава»). In these cases we left the old characters: ѣ, і (ѵ, ѳ are not found in the edition of Pushkin's works).
If you encounter any bugs on КТП inform us by E-mails to isao@yasuda.homeip.net.