Model Performance Begins With Data: Researchers From Ai2 Release Datadecide—a Benchmark Suite To Understand Pretraining Data Impact Across 30k Llm Checkpoints

Trending 1 day ago
ARTICLE AD BOX

The Challenge of Data Selection successful LLM Pretraining

Developing ample connection models entails important computational investment, particularly erstwhile experimenting pinch replacement pretraining corpora. Comparing datasets astatine afloat scale—on nan bid of billions of parameters and hundreds of billions of tokens—can devour hundreds of thousands of GPU hours per run. Consequently, practitioners edifice to smaller‐scale experiments arsenic proxies for large‐model behavior. Yet these “pilot” studies are seldom published, producing a fragmented scenery successful which each laboratory repeats akin small‐scale tests without shared benchmarks aliases methodologies . This opacity impedes reproducibility, underutilizes corporate insights, and obscures nan existent trade‑offs betwixt improvement compute and last exemplary performance.

DataDecide

To reside these limitations, nan Allen Institute for AI (AI2), successful collaboration pinch nan University of Washington and nan University of Pennsylvania, coming releases DataDecide—a broad suite of controlled pretraining experiments spanning 25 chopped corpora and 14 exemplary sizes from 4 million to 1 billion parameters. DataDecide’s datasets see well‑known sources specified arsenic Dolma, DCLM, RefinedWeb, C4, and FineWeb, alongside variations produced by domain ablation, deduplication, value filtering, and root mixing. Each exemplary is trained astatine a fixed token‑to‑parameter ratio of 100 (100 tokens per parameter), reflecting nan “overtraining” authorities that optimizes conclusion efficiency. In total, complete 1,050 models and much than 30,000 checkpoints—each evaluated crossed 10 downstream tasks—are released to nan public.

Technical Structure and Pragmatic Benefits

DataDecide orchestrates experiments on 3 axes:

    • Data Recipes: Twenty‑five well‑documented pretraining corpora, each embodying different curation strategies (see Table 1 successful nan insubstantial for afloat look specifications) .
    • Model Scale: Fourteen parameter configurations (4 M–1 B), programmatically derived via nan OLMo exemplary ladder to guarantee accordant training hyperparameters crossed scales. Each non‑target standard includes 2 “early‑stop” seed runs, while nan 1 B‑parameter models characteristic 3 complete seed reruns to quantify variability.
    • Evaluation Suite: The OLMES benchmark of 10 multiple‑choice tasks (e.g., MMLU, ARC Easy/Challenge, HellaSwag, MBPP, HumanEval) provides a multifaceted position of connection understanding, commonsense reasoning, and codification procreation performance.

    By releasing some pretraining datasets and corresponding models, DataDecide enables researchers to:

    • Reuse checkpoints for caller evaluations without retraining.
    • Experiment pinch caller prediction methods (e.g., precocious scaling‑law fits, smoothing techniques).
    • Investigate benchmark sensitivity to training information and exemplary scale.

    Key Findings and Quantitative Insights

    DataDecide’s systematic study yields 4 applicable guidelines:

      • Single‑Scale Baseline Robustness: Ranking corpora by downstream accuracy astatine a single, mini standard (e.g., 150 M parameters) achieves ~80 percent determination accuracy for predicting nan champion dataset astatine nan 1 B‑parameter target scale. In contrast, 8 baseline scaling‑law extrapolations do not surpass this elemental heuristic, underscoring its cost‑effectiveness.
      • Task‑Dependent Compute Sensitivity: The compute fund required for reliable decisions varies markedly by task. Benchmarks for illustration MMLU and ARC Easy go predictable pinch little than 0.01 percent of nan target compute, whereas HellaSwag and SocialIQA request orders of magnitude much FLOPs to execute akin determination accuracy .
      • Proxy Metric Selection: Continuous likelihood metrics—specifically nan character‑normalized mean probability of correct continuations (CORRECT PROB) and full probability (TOTAL PROB)—outperform discrete accuracy measures astatine mini scales. This is astir pronounced connected codification tasks (MBPP, HumanEval), wherever determination accuracy jumps from near‑random to complete 80 percent pinch CORRECT PROB arsenic nan proxy .
      • Variance and Spread Considerations: High determination accuracy correlates pinch debased run‑to‑run variance (noise) and ample capacity dispersed crossed datasets. Proxy metrics that trim sound aliases amplify dispersed frankincense straight heighten prediction reliability.

      Concluding Perspective

      DataDecide transforms pretraining information action from an advertisement hoc creation into a transparent, data‐driven science. By open‑sourcing each 25 corpora, 1,050 models, 30,000+ checkpoints, and information scripts connected Hugging Face and GitHub, AI2 invites nan organization to reproduce findings, widen evaluations to caller benchmarks, and innovate connected decision‑making methods. As LLM improvement continues to request ever‑greater compute resources, DataDecide offers a opinionated model for minimizing wasted experiments and maximizing insight—paving nan measurement toward much efficient, reproducible, and collaborative AI research.


      Check retired nan Paper, Model connected Hugging Face and Technical details. Also, don’t hide to travel america on Twitter and subordinate our Telegram Channel and LinkedIn Group. Don’t Forget to subordinate our 90k+ ML SubReddit.

      🔥 [Register Now] miniCON Virtual Conference connected AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 p.m. PST) + Hands connected Workshop

        Asif Razzaq is nan CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing nan imaginable of Artificial Intelligence for societal good. His astir caller endeavor is nan motorboat of an Artificial Intelligence Media Platform, Marktechpost, which stands retired for its in-depth sum of instrumentality learning and heavy learning news that is some technically sound and easy understandable by a wide audience. The level boasts of complete 2 cardinal monthly views, illustrating its fame among audiences.

        More