ARTICLE AD BOX
Part 1: Uploading a Dataset to Hugging Face Hub
Introduction
This portion of nan tutorial walks you done nan process of uploading a civilization dataset to nan Hugging Face Hub. The Hugging Face Hub is simply a level that allows developers to stock and collaborate connected datasets and models for machine learning.
Here, we’ll return an existing Python instruction-following dataset, toggle shape it into a format suitable for training nan latest Large Language Models (LLMs), and past upload it to Hugging Face for nationalist use. We’re specifically formatting our information to lucifer nan Llama 3.2 chat template, which makes it fresh for fine-tuning Llama 3.2 models.
Step 1: Installation and Authentication
First, we request to instal nan basal libraries and authenticate pinch nan Hugging Face Hub:
What’s happening here:
- datasets is Hugging Face’s room for moving pinch instrumentality learning datasets
- The quiet emblem -q reduces installation output messages
- huggingface-cli login will punctual you to participate your Hugging Face authentication token
- You tin find your token by going to your Hugging Face relationship settings → Access Tokens
After moving this cell, you will beryllium prompted to participate your token. This authenticates your convention and allows you to push contented to nan Hub.
Step 2: Load nan Dataset and Define nan Transformation Function
Next, we’ll load an existing dataset and specify a usability to toggle shape it to lucifer nan Llama 3.2 chat format:
What’s happening here:
- We load nan ‘Vezora/Tested-143k-Python-Alpaca’ dataset, which contains Python programming instructions and outputs
- We specify a translator usability that restructures each illustration into nan Llama 3.2 chat format
- We see a elaborate strategy punctual that gives nan exemplary discourse astir its domiciled arsenic a Python coding assistant
- The typical tokens for illustration <|begin_of_text|>, <|start_header_id|>, and <|eot_id|> are Llama 3.2’s measurement of formatting conversational data
- This usability creates a decently formatted speech pinch system, user, and adjunct messages
The strategy punctual is peculiarly important arsenic it defines nan persona and behaviour expectations for nan model. In this case, we’re instructing nan exemplary to enactment arsenic an master Python coding adjunct that follows champion practices and provides well-commented, businesslike solutions.
Step 3: Apply nan Transformation to nan Dataset
Now we use our translator usability to nan full dataset:
What’s happening here:
- The map() usability applies our translator usability to each illustration successful nan dataset
- This processes each 143,000 examples successful nan dataset, reformatting them into nan Llama 3.2 chat format
- The consequence is simply a caller dataset pinch nan aforesaid contented but system decently for fine-tuning Llama 3.2
This translator is important because it reformats nan information into nan circumstantial template required by nan Llama 3.2 exemplary family. Without this formatting, nan exemplary wouldn’t admit nan different roles successful nan speech (system, user, assistant) aliases wherever each connection originates and ends.
Step 4: Upload nan Dataset to Hugging Face Hub
With our dataset prepared, we tin now upload it to nan Hugging Face Hub:
What’s happening here:
- The push_to_hub() method uploads our transformed dataset to nan Hugging Face Hub
- “Llama-3.2-Python-Alpaca-143k” will beryllium nan sanction of your dataset repository
- This creates a caller repository nether your username: https://huggingface.co/datasets/YOUR_USERNAME/Llama-3.2-Python-Alpaca-143k
- The dataset will now beryllium publically disposable for others to download and use
After moving this cell, you’ll spot advancement bars indicating nan upload status. Once complete, you tin sojourn nan Hugging Face Hub to position your recently uploaded dataset, edit its description, and stock it pinch nan community.
This dataset is now fresh to beryllium utilized for fine-tuning Llama 3.2 models connected Python programming tasks, pinch decently formatted conversations that see strategy instructions, personification queries, and adjunct responses!
Part 2: Fine-tuning and Uploading a Model to Hugging Face Hub
Now that we’ve prepared and uploaded our dataset, let’s move connected to fine-tuning a exemplary and uploading it to nan Hugging Face Hub.
Step 1: Install Required Libraries
First, we request to instal each nan basal libraries for fine-tuning ample connection models efficiently:
What this does: Installs Unsloth (a room for faster LLM fine-tuning), nan latest type of Transformers, TRL (for reinforcement learning), PEFT (for parameter-efficient fine-tuning), and different limitations needed for training. The xformers and bitsandbytes libraries thief pinch representation efficiency.
Step 2: Load nan Dataset
Next, we load nan dataset we prepared successful nan erstwhile section:
What this does: Sets nan maximum series magnitude for our exemplary and loads our antecedently uploaded Python coding dataset from Hugging Face.
Step 3: Load nan Pre-trained Model
Now we load a quantized type of Llama 3.2:
What this does: Loads a 4-bit quantized type of Llama 3.2 3B Instruct exemplary from Unsloth’s repository. Quantization reduces nan representation footprint while maintaining astir of nan model’s performance.
Step 4: Configure PEFT (Parameter-Efficient Fine-Tuning)
We’ll group up nan exemplary for businesslike fine-tuning utilizing LoRA (Low-Rank Adaptation):
What this does: Configures nan exemplary for Parameter-Efficient Fine-Tuning pinch LoRA. This method only trains a mini number of caller parameters while keeping astir of nan original exemplary frozen, allowing businesslike training pinch constricted resources. We’re targeting circumstantial projection layers successful nan exemplary pinch a rank of 16.
Step 5: Mount Google Drive for Saving
To guarantee our trained exemplary is saved moreover if nan convention disconnects:
What this does: Mounts your Google Drive to prevention checkpoints and nan last model.
Step 6: Set Up Training and Start Training
Now we configure and commencement nan training process:
What this does: Creates a Supervised Fine-Tuning Trainer pinch our model, dataset, and training parameters. The training runs for 60 steps pinch a batch size of 2, gradient accumulation of 4, and a learning complaint of 2e-4. The exemplary checkpoints will beryllium saved to Google Drive.
Step 7: Save nan Fine-tuned Model Locally
After training, we prevention our model:
What this does: Saves nan fine-tuned LoRA exemplary and tokenizer to a section directory.
Step 8: Upload nan Model to Hugging Face Hub
Finally, we upload our fine-tuned exemplary to Hugging Face:
In this guide, we demonstrated a complete workflow for AI exemplary customization utilizing Hugging Face. We transformed a Python instruction dataset into Llama 3.2 format pinch a specialized strategy punctual and uploaded it arsenic “Llama-3.2-Python-Alpaca-143k”. We past fine-tuned a Llama 3.2 exemplary utilizing businesslike techniques (4-bit quantization and LoRA) pinch minimal computing resources. Finally, we shared some resources connected Hugging Face Hub, making our Python coding adjunct disposable to nan community. This task showcases really accessible AI improvement has become, enabling developers to create specialized models for circumstantial tasks pinch comparatively humble resources.
Here is nan Colab Notebook_Llama_3_2_3B_Instruct_code and Colab Notebook_Llama_3_2_Python_Alpaca_143k . Also, don’t hide to travel america on Twitter and subordinate our Telegram Channel and LinkedIn Group. Don’t Forget to subordinate our 90k+ ML SubReddit.
🔥 [Register Now] miniCON Virtual Conference connected AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 p.m. PST) + Hands connected Workshop
Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.