Researchers From Tsinghua And Modelbest Release Ultra-fineweb: A Trillion-token Dataset Enhancing Llm Accuracy Across Benchmarks

Trending 12 hours ago
ARTICLE AD BOX

The information value utilized successful pretraining LLMs has go progressively captious to their success. To build information-rich corpora, researchers person moved from heuristic filtering methods, specified arsenic rule-based sound removal and deduplication, to model-driven filtering, which leverages neural classifiers to place high-quality samples. Despite its benefits, this attack still faces cardinal issues: it lacks businesslike validation mechanisms to measure information value promptly and often relies connected manually curated seed datasets that present subjectivity. While early datasets for illustration C4 and Pile laid nan groundwork for exemplary development, caller efforts for illustration RefinedWeb, Dolma, and DCLM person scaled significantly, incorporating up to trillions of tokens. Model-driven filtering has gained traction successful these newer corpora for its expertise to refine monolithic datasets and heighten LLM capacity crossed downstream tasks.

Nevertheless, nan effectiveness of model-driven filtering is constricted by nan precocious costs and inefficiencies of existent validation methods and nan absence of clear standards for seed information selection. Recent datasets, specified arsenic FineWeb-edu and Ultra-FineWeb, person demonstrated improved exemplary capacity by utilizing aggregate classifiers to cross-verify information quality. These datasets outperform erstwhile versions connected benchmarks for illustration MMLU, ARC, and C-Eval, indicating that refined filtering methods tin heighten English and Chinese understanding. To further optimize this process, immoderate studies propose utilizing LLMs for multi-dimensional information information via prompts aliases leveraging token-level perplexity scores. These innovations purpose to little computational overhead while improving information quality, yet enabling much effective training pinch less tokens. 

Researchers from ModelBest Inc., Tsinghua University, and Soochow University developed an businesslike information filtering pipeline to amended LLM training. They introduced a verification strategy that uses a nearly-trained LLM to measure caller information by watching capacity gains during last training steps, reducing computational costs. A lightweight fastText-based classifier further enhances filtering velocity and accuracy. Applied to FineWeb and Chinese FineWeb datasets, this method produced nan Ultra-FineWeb dataset, containing 1 trillion English and 120 cardinal Chinese tokens. LLMs trained connected Ultra-FineWeb showed notable capacity gains, confirming nan pipeline’s effectiveness successful improving information value and training efficiency. 

The study outlines an efficient, high-quality information filtering pipeline to trim computational costs while maintaining information integrity. It originates by utilizing a cost-effective verification strategy to prime reliable seed samples from a campaigner pool, which are past utilized to train a information classifier. Positive seeds are originated from LLM annotations, curated datasets, textbooks, and synthesized content, while negatives travel from divers corpora. Classifier training avoids over-iteration, focusing alternatively connected high-quality seed selection. A fastText-based classifier is utilized for scalable filtering, offering competitory capacity astatine importantly little conclusion costs compared to LLM-based methods, pinch preprocessing steps ensuring balanced, cleanable information input. 

The models were trained utilizing MegatronLM pinch nan MiniCPM-1.2 B architecture connected 100B tokens. Evaluations utilized Lighteval crossed English and Chinese benchmarks. The results show that models trained connected Ultra-FineWeb consistently outperformed those trained connected FineWeb and FineWeb-edu, individually and successful mixed-language settings. Ultra-FineWeb-en achieved nan highest English mean score, while Ultra-FineWeb-zh improved capacity connected Chinese tasks. Ablation studies revealed that Ultra-FineWeb maintains balanced token lengths and benefits from businesslike filtering strategies, highlighting its superior value and effectiveness successful improving exemplary performance. 

In conclusion, nan study presents Ultra-FineWeb, a high-quality multilingual dataset comprising astir 1 trillion English tokens and 120 cardinal Chinese tokens. Built upon FineWeb and Chinese FineWeb, it leverages a novel, businesslike information filtering pipeline featuring a fastText-based lightweight classifier and a low-cost verification strategy. The pipeline enhances filtering accuracy, reduces reliance connected manual seed information selection, and ensures robust capacity pinch minimal computational overhead. Experimental results show that models trained connected Ultra-FineWeb consistently outperform those trained connected earlier datasets, demonstrating improved capacity crossed benchmarks. The methodology ensures reproducibility and offers valuable insights for optimizing information value successful early LLM training. 


Check out the Paper and Dataset. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 90k+ ML SubReddit.

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.

More