Large-Scale Text Corpus Deduplication and Dataset Enhancement
- Developed and deployed text corpus deduplication using a suffix array algorithm on MapReduce, boosting assessor F1-score from 0.77 to 0.82.
- Trained a classifier to improve benchmark coverage, enhancing dataset relevance for pretraining tasks.
- Created a dataset augmentation pipeline with Back Translation, increasing pretraining robustness and generalization.
- Enhanced document parsing quality, resulting in 3% faster model convergence and improved resource efficiency.