Uploading Datasets To Hugging Face: A Step-by-step Guide

2 days ago

ARTICLE AD BOX

Part 1: Uploading a Dataset to Hugging Face Hub

Introduction

This portion of nan tutorial walks you done nan process of uploading a civilization dataset to nan Hugging Face Hub. The Hugging Face Hub is simply a level that allows developers to stock and collaborate connected datasets and models for machine learning.

Here, we’ll return an existing Python instruction-following dataset, toggle shape it into a format suitable for training nan latest Large Language Models (LLMs), and past upload it to Hugging Face for nationalist use. We’re specifically formatting our information to lucifer nan Llama 3.2 chat template, which makes it fresh for fine-tuning Llama 3.2 models.

Step 1: Installation and Authentication

First, we request to instal nan basal libraries and authenticate pinch nan Hugging Face Hub:

!pip instal -q datasets !huggingface-cli login

What’s happening here:

datasets is Hugging Face’s room for moving pinch instrumentality learning datasets
The quiet emblem -q reduces installation output messages
huggingface-cli login will punctual you to participate your Hugging Face authentication token
You tin find your token by going to your Hugging Face relationship settings → Access Tokens

After moving this cell, you will beryllium prompted to participate your token. This authenticates your convention and allows you to push contented to nan Hub.

Step 2: Load nan Dataset and Define nan Transformation Function

Next, we’ll load an existing dataset and specify a usability to toggle shape it to lucifer nan Llama 3.2 chat format:

from datasets import load_dataset # Load your complete civilization dataset dataset = load_dataset('Vezora/Tested-143k-Python-Alpaca') # Define a usability to toggle shape nan data def transform_conversation(example): system_prompt = """ You are an master Python coding assistant. Your domiciled is to thief users constitute clean, efficient, and bug-free Python code. You person been trained connected a divers group of high-quality Python codification samples, each of which passed rigorous automated testing for functionality and performance. Always travel champion practices successful Python programming, supply concise and readable solutions, and guarantee that your responses see informative comments erstwhile necessary. When presented pinch a coding problem, first create a elaborate pseudocode that outlines the building and logic of nan solution step-by-step. Once nan pseudocode is complete, travel it to make nan existent Python code. This attack will thief ensure clarity and alignment pinch nan desired logic earlier penning nan code. If asked to modify existing code, supply pseudocode highlighting nan changes and optimizations to beryllium made, focusing connected improvements related to performance, correction handling, and robustness. Remember to explicate your thought process and rationale intelligibly for immoderate modifications aliases codification suggestions you provide. """ instruction = example['instruction'].strip() # Accessing nan instruction column output = example['output'].strip() # Accessing nan output column formatted_text = ( f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|> {system_prompt} <|eot_id|>n<|start_header_id|>user<|end_header_id|> {instruction} <|eot_id|><|start_header_id|>assistant<|end_header_id|> {output}<|eot_id|>""" ) # instruction = example['instruction'].strip() # Accessing nan instruction column # output = example['output'].strip() # Accessing nan output column # Apply nan caller template # Since location is nary strategy prompt, we conception nan drawstring without nan SYS part # formatted_text = f'<s>[INST] {instruction} [/INST] {output} </s>' return {'text': formatted_text}

What’s happening here:

We load nan ‘Vezora/Tested-143k-Python-Alpaca’ dataset, which contains Python programming instructions and outputs
We specify a translator usability that restructures each illustration into nan Llama 3.2 chat format
We see a elaborate strategy punctual that gives nan exemplary discourse astir its domiciled arsenic a Python coding assistant
The typical tokens for illustration <|begin_of_text|>, <|start_header_id|>, and <|eot_id|> are Llama 3.2’s measurement of formatting conversational data
This usability creates a decently formatted speech pinch system, user, and adjunct messages

The strategy punctual is peculiarly important arsenic it defines nan persona and behaviour expectations for nan model. In this case, we’re instructing nan exemplary to enactment arsenic an master Python coding adjunct that follows champion practices and provides well-commented, businesslike solutions.

Step 3: Apply nan Transformation to nan Dataset

Now we use our translator usability to nan full dataset:

# Apply nan translator to nan full dataset transformed_dataset = dataset['train'].map(transform_conversation)

What’s happening here:

The map() usability applies our translator usability to each illustration successful nan dataset
This processes each 143,000 examples successful nan dataset, reformatting them into nan Llama 3.2 chat format
The consequence is simply a caller dataset pinch nan aforesaid contented but system decently for fine-tuning Llama 3.2

This translator is important because it reformats nan information into nan circumstantial template required by nan Llama 3.2 exemplary family. Without this formatting, nan exemplary wouldn’t admit nan different roles successful nan speech (system, user, assistant) aliases wherever each connection originates and ends.

Step 4: Upload nan Dataset to Hugging Face Hub

With our dataset prepared, we tin now upload it to nan Hugging Face Hub:

transformed_dataset.push_to_hub("Llama-3.2-Python-Alpaca-143k")

What’s happening here:

The push_to_hub() method uploads our transformed dataset to nan Hugging Face Hub
“Llama-3.2-Python-Alpaca-143k” will beryllium nan sanction of your dataset repository
This creates a caller repository nether your username: https://huggingface.co/datasets/YOUR_USERNAME/Llama-3.2-Python-Alpaca-143k
The dataset will now beryllium publically disposable for others to download and use

After moving this cell, you’ll spot advancement bars indicating nan upload status. Once complete, you tin sojourn nan Hugging Face Hub to position your recently uploaded dataset, edit its description, and stock it pinch nan community.

This dataset is now fresh to beryllium utilized for fine-tuning Llama 3.2 models connected Python programming tasks, pinch decently formatted conversations that see strategy instructions, personification queries, and adjunct responses!

Part 2: Fine-tuning and Uploading a Model to Hugging Face Hub

Now that we’ve prepared and uploaded our dataset, let’s move connected to fine-tuning a exemplary and uploading it to nan Hugging Face Hub.

Step 1: Install Required Libraries

First, we request to instal each nan basal libraries for fine-tuning ample connection models efficiently:

!pip instal "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" !pip instal "git+https://github.com/huggingface/transformers.git" !pip instal -U trl !pip instal --no-deps trl peft accelerate bitsandbytes !pip instal torch torchvision torchaudio triton !pip instal xformers !python -m xformers.info !python -m bitsandbytes

What this does: Installs Unsloth (a room for faster LLM fine-tuning), nan latest type of Transformers, TRL (for reinforcement learning), PEFT (for parameter-efficient fine-tuning), and different limitations needed for training. The xformers and bitsandbytes libraries thief pinch representation efficiency.

Step 2: Load nan Dataset

Next, we load nan dataset we prepared successful nan erstwhile section:

from unsloth import FastLanguageModel from trl import SFTTrainer from transformers import TrainingArguments import torch from datasets import load_dataset max_seq_length = 2048 dataset = load_dataset("nikhiljatiwal/Llama-3.2-Python-Alpaca-143k", split="train")

What this does: Sets nan maximum series magnitude for our exemplary and loads our antecedently uploaded Python coding dataset from Hugging Face.

Step 3: Load nan Pre-trained Model

Now we load a quantized type of Llama 3.2:

model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", max_seq_length = max_seq_length, dtype = None, load_in_4bit = True )

What this does: Loads a 4-bit quantized type of Llama 3.2 3B Instruct exemplary from Unsloth’s repository. Quantization reduces nan representation footprint while maintaining astir of nan model’s performance.

Step 4: Configure PEFT (Parameter-Efficient Fine-Tuning)

We’ll group up nan exemplary for businesslike fine-tuning utilizing LoRA (Low-Rank Adaptation):

model = FastLanguageModel.get_peft_model( model, r = 16, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized # [NEW] "unsloth" uses 30% little VRAM, fits 2x larger batch sizes! use_gradient_checkpointing = "unsloth", # True aliases "unsloth" for very agelong context random_state = 3407, use_rslora = False, # We support rank stabilized LoRA loftq_config = None, # And LoftQ max_seq_length = max_seq_length )

What this does: Configures nan exemplary for Parameter-Efficient Fine-Tuning pinch LoRA. This method only trains a mini number of caller parameters while keeping astir of nan original exemplary frozen, allowing businesslike training pinch constricted resources. We’re targeting circumstantial projection layers successful nan exemplary pinch a rank of 16.

Step 5: Mount Google Drive for Saving

To guarantee our trained exemplary is saved moreover if nan convention disconnects:

from google.colab import drive drive.mount("/content/drive")

What this does: Mounts your Google Drive to prevention checkpoints and nan last model.

Step 6: Set Up Training and Start Training

Now we configure and commencement nan training process:

trainer = SFTTrainer( exemplary = model, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, tokenizer = tokenizer, args = TrainingArguments( per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_steps = 10, # num_train_epochs = 1, # Set this for 1 afloat training run. max_steps = 60, learning_rate = 2e-4, fp16 = not torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "/content/drive/My Drive/Llama-3.2-3B-Instruct-bnb-4bit" ), ) trainer.train()

What this does: Creates a Supervised Fine-Tuning Trainer pinch our model, dataset, and training parameters. The training runs for 60 steps pinch a batch size of 2, gradient accumulation of 4, and a learning complaint of 2e-4. The exemplary checkpoints will beryllium saved to Google Drive.

Step 7: Save nan Fine-tuned Model Locally

After training, we prevention our model:

model.save_pretrained("lora_model") # Local saving tokenizer.save_pretrained("lora_model")

What this does: Saves nan fine-tuned LoRA exemplary and tokenizer to a section directory.

Step 8: Upload nan Model to Hugging Face Hub

Finally, we upload our fine-tuned exemplary to Hugging Face:

import os from google.colab import userdata HF_TOKEN = userdata.get('HF_WRITE_API_KEY') model.push_to_hub_merged("nikhiljatiwal/Llama-3.2-3B-Instruct-code-bnb-4bit", tokenizer, save_method = "merged_16bit", token=HF_TOKEN)

In this guide, we demonstrated a complete workflow for AI exemplary customization utilizing Hugging Face. We transformed a Python instruction dataset into Llama 3.2 format pinch a specialized strategy punctual and uploaded it arsenic “Llama-3.2-Python-Alpaca-143k”. We past fine-tuned a Llama 3.2 exemplary utilizing businesslike techniques (4-bit quantization and LoRA) pinch minimal computing resources. Finally, we shared some resources connected Hugging Face Hub, making our Python coding adjunct disposable to nan community. This task showcases really accessible AI improvement has become, enabling developers to create specialized models for circumstantial tasks pinch comparatively humble resources.

Here is nan Colab Notebook_Llama_3_2_3B_Instruct_code and Colab Notebook_Llama_3_2_Python_Alpaca_143k . Also, don’t hide to travel america on Twitter and subordinate our Telegram Channel and LinkedIn Group. Don’t Forget to subordinate our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference connected AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 p.m. PST) + Hands connected Workshop

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.