Complete implementation of English-to-Russian neural machine translation system using attention mechanisms and transformer architectures.
This project develops and evaluates a neural machine translation (NMT) system for English-to-Russian translation using modern deep learning approaches. The system fine-tunes the Helsinki-NLP/opus-mt-en-ru pre-trained model on the opusbooks dataset.
- Transformer-based architecture with attention mechanisms
- Fine-tuned Helsinki-NLP/opus-mt-en-ru model
- Training on 10,000+ sentence pairs from opusbooks
- Evaluation using multiple metrics (BLEU, chrF, TER)
- Handling of Russian morphological complexity
- Integration with Hugging Face Transformers library
- SacreBLEU Score: 22.6
- chrF Score: 48.5
- Training: Kaggle T4 GPU environment
- Dataset: OpenParallel opusbooks (EN-RU)
pip install torch transformers datasets evaluateDataset • Source: OpenParallel opusbooks corpus • Language Pair: English → Russian • Training Samples: 10,000+ • Format: JSON with ‘src’ and ‘tgt’ fields Architecture • Model: Sequence-to-Sequence with Transformer encoder-decoder • Encoder: Multi-head self-attention (12 heads) • Decoder: Masked multi-head attention with cross-attention • Vocabulary: BPE tokenization (32k tokens) Related Publications • Development and Evaluation of an English-to-Russian Neural Machine Translation System • Developpement et Evaluation d’un Systeme de Traduction Automatique Neuronale • Systeme de traduction automatique neuronal du russe vers l’anglais (Conference)
See: https://huggingface.co/Helsinki-NLP/opus-mt-en-ru
MIT License - See LICENSE file for details
Dominique S. Loyer
- ORCID: https://orcid.org/0009-0003-9713-7109
- GitHub: https://github.com/DominiqueLoyer
- Affiliation: Universite du Quebec a Montreal (UQAM)
- Solution: Fine-tuning on morphologically rich dataset
- Pre-trained model handles inflectional suffixes
- Subword tokenization captures Russian morphology
- Solution: Filtering of low-quality alignments
- Validation on in-domain test set
- Manual review of translation outputs
- Multilingual support (French, German)
- Bidirectional translation (RU-EN)
- Domain-specific fine-tuning
- Integration with back-translation for data augmentation
- Evaluation on news domain benchmark
Contributions welcome! Areas of interest:
- Back-translation data augmentation
- Domain adaptation experiments
- Evaluation on standard benchmarks
Last Updated: January 6, 2026 Status: Ready for production use
Bleu Score