RuLexNorm: A Lexical Normalization Dataset for Russian Social Media Text

RuLexNorm — the first open dataset for lexical normalization of Russian-language social media texts.

Disclaimer: This dataset contains real comments with explicit or potentially sensitive content.

Description

Dataset

The data/RuLexNorm.norm file contains the core dataset of over 6,000 Russian text pairs from Twitter (recently X), mapping informal utterances to their normalized equivalents.

Model

The provided baseline model for lexical normalization is implemented in PyTorch using the Hugging Face transformers library. It is based on a fine-tuned version of the Qwen2.5-3B model, adapted for Russian.

The model is saved in the LexNorm/model/ directory.

Data Format

The corpus is organized into entries separated by blank lines. Each entry consists of one or more lines representing a single text pair.

Each line within an entry follows this structure:

Original token [TAB] Normalized form

The normalization can be of three types:

One-to-One (1-1): A single token is normalized to a single word.

Прив! -> Привет!

One-to-Many (1-N): A single token is normalized to multiple words, separated by a space.

Чд? -> Что делаешь?

Many-to-One (N-1): Multiple consecutive tokens are merged into a single normalized form. The normalization is provided for the first token, and subsequent tokens in the sequence have an empty string as their normalized form.

Чуть -> Чуть-чуть
чуть ->

Authors

Irina Koliaskina
Dmitry Sholomov

License

This work is licensed under the Creative Commons Attribution-ShareAlike 2.5 Generic License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

For any questions or suggestions, please contact Irina Koliaskina ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
model		model
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RuLexNorm: A Lexical Normalization Dataset for Russian Social Media Text

Description

Data Format

Authors

License

About

Uh oh!

Releases

Packages

SmartEngines/RuLexNorm

Folders and files

Latest commit

History

Repository files navigation

RuLexNorm: A Lexical Normalization Dataset for Russian Social Media Text

Description

Data Format

Authors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages