Original title: Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Authors: Tong Zhou, Yubo Chen, Pengfei Cao, Kang Liu, Jun Zhao, Shengping Liu
This article unveils “Oasis,” a game-changer in crafting and evaluating data for large language models. Prior systems lacked the finesse to personalize data curation pipelines or optimize them iteratively. Enter Oasis—a powerhouse platform offering both data enhancement and measurement through user-friendly interfaces. Its interactive rule filter adapts rules based on feedback, while the debiased neural filter ensures fairness by identifying and removing biases. Meanwhile, the adaptive document deduplication manages extensive deduplication tasks efficiently. These components combine to form a tailored data curation module. The platform also features a comprehensive data assessment module, evaluating corpora locally and globally using human input, GPT-4, and heuristic metrics. The article illustrates Oasis’s complete workflow, showcasing its prowess in crafting and evaluating pretraining data. As a cherry on top, Oasis gifts an 800GB bilingual corpus, open to the public—a testament to its capabilities.
Original article: https://arxiv.org/abs/2311.12537