Askapro

Large-Scale Text Corpus Deduplication and Dataset Enhancement

  • Developed and deployed text corpus deduplication using a suffix array algorithm on MapReduce, boosting assessor F1-score from 0.77 to 0.82.
  • Trained a classifier to improve benchmark coverage, enhancing dataset relevance for pretraining tasks.
  • Created a dataset augmentation pipeline with Back Translation, increasing pretraining robustness and generalization.
  • Enhanced document parsing quality, resulting in 3% faster model convergence and improved resource efficiency.
Andrey worked on this case as the ML engineer at Yandex GPT.
Global
Git
Internal product
ML engineer
Natural Language Processing
Large Language Models (LLM)
Benchmark Coverage Optimisation
Ranking Optimisation
Back Translation
AI
Research and Development
LLM Pretraining Pipeline
Data Pipeline
Python
Pandas
Numpy
MapReduce
Tensorboard
Show more
Meber iconMeber thumbnail

Andrey

Machine Learning Engineer at Yandex Zen

Andrey's cases
Show more

Similar cases

Show more