NSU LabADT & RNC: Morpheme Segmentation

Morpheme Segmentation Models of the Russian National Corpus

On this page you can test the morpheme segmentation models developed by the team of Laboratory of Applied Digital Technologies (Novosibirsk State University) in collaboration with the Russian National Corpus.

Try segmentation online!

Belarusian Czech Russian

Word

Please enter the word lemma without spaces, punctuation marks and numbers.

Segmentation: разбор

How it works?

First, we splitted each word to letters and assigned each letter BMES-label with morpheme type:

Then we fine-tuned pretrained BERT-like models for characher-level annotation for 30 epochs: roberta-small-belarusian for Belarusian, Czert-B-base-cased for Czech, and RuRoberta-large for Russian.

Find out more about our research:

BERT-like Models for Slavic Morpheme Segmentation. Morozov et al. (2025) In this study, we compared the quality of morpheme segmentation using convolutional neural networks and pre-trained BERT-like models using three Slavic languages: Belarusian, Czech, and Russian. It turned out that for Russian and Czech, the use of pre-trained models allowed us to improve the quality and achieve better recognition of out-of-vocabulary roots.
Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts? Morozov et al. (2024) In this study, we compared the quality of different morpheme segmentation models (GBDT, LSTM, CNN, Transformer) for the Russian language with each other and with experts. It turned out that the convolutional networks outperfom other approaches and segment words on average at the expert level. However, none of the architectures cope with out-of-vocabulary morphemes.
Generalization Ability of CNN-Based Morpheme Segmentation. Garipov et al. (2023) In this study, we tested how the quality of automatic segmentation of Russian words using convolutional neural networks behaves in the case of a reduced training sample, working with out-of-vocabulary roots, and transferring learning between different dictionaries.

How to cite

Dmitry Morozov, Lizaveta Astapenka, Anna Glazkova, Timur Garipov, and Olga Lyashevskaya. 2025. BERT-like Models for Slavic Morpheme Segmentation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6795–6815, Vienna, Austria. Association for Computational Linguistics.