Can LIMIT Enhance Instruction Tuning in Different Evaluations?

Original title: LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms

Authors: Aditi Jha, Sam Havens, Jeremey Dohmann, Alex Trott, Jacob Portes

The article explores how large language models (LLMs) are typically fine-tuned using big instruction datasets, but recent studies suggest smaller, high-quality sets might work just as well. This difference in opinion stems from varied ways to evaluate these models. They aim to find if a small, diverse set of fine-tuning samples can improve performance across two types of evaluations: traditional language understanding tests and open-ended, model-based assessments. They fine-tune two open-source models on instruction sets of different sizes, from 1k to 60k samples. Surprisingly, they discover that using subsets of 1k to 6k samples works well for both traditional tests and model-based evaluations. Additionally, they find that blending different types of fine-tuning datasets boosts performance across both evaluation methods. Ultimately, their work sheds light on how to fine-tune models effectively, showing that smaller, varied datasets might be key for better performance in diverse evaluation scenarios.

Original article: https://arxiv.org/abs/2311.13133