Comprehensive Guide: Creating a LoRA on AMD Infrastructure

Creating a LoRA is one of the most efficient ways to customize a Large Language Model (LLM) for your specific use case. With a dataset of roughly 1,000 items, your team is in an excellent position to significantly alter the behavior, tone, or specific formatting capabilities of a base model.

Here is a step-by-step breakdown of the concepts, prerequisites, and the technical workflow—including leveraging your AMD GPU environment and RAG-based evaluation.

1. What is a LoRA?

LoRA stands for Low-Rank Adaptation. It is a Parameter-Efficient Fine-Tuning (PEFT) technique used to train large models without requiring massive computational resources.

When you traditionally fine-tune a model, you update all of its internal parameters (weights). For modern LLMs, this means updating billions of numbers, requiring enormous amounts of GPU memory.

A LoRA takes a different approach:

Freezes the Base Model: The original weights of the pre-trained LLM are locked and not changed.
Adds Small Adapters: It introduces tiny, low-rank matrices (the "LoRA weights") into specific layers of the model (usually the attention layers).
Trains Only the Adapters: During training, only these small, injected matrices are updated.

The result is a tiny file (often just 50MB to 500MB) containing the LoRA weights, which acts like a "patch" or a "lens" placed over the base model to change its behavior.

LoRA vs. RAG (Context and VRAM): It is important to distinguish LoRA from techniques like RAG (Retrieval-Augmented Generation). While RAG injects retrieved external text directly into your prompt—thereby consuming valuable space in the model's context capacity—LoRA bakes the learned behavior directly into the model via the aforementioned adapters. Because these adapters are extremely lightweight, applying a LoRA does not require significantly more VRAM during inference and, crucially, keeps your full context window completely available for user interactions.

2. Prerequisites: The Model and Available Weights

Before training, you need to establish your foundational components:

The Base Model: You must select an open-weights model to act as your foundation. Popular choices include Llama-3 (8B or 70B), Mistral, or Qwen.
Available Weights: The weights for these base models are typically hosted on platforms like Hugging Face. Your team will need to create a Hugging Face account, accept the model licenses (if applicable, like for Llama), and generate an Access Token to download the weights programmatically.
AMD Compute Environment: Because you are using an AMD developer account, your underlying software stack will use ROCm (Radeon Open Compute) instead of NVIDIA's CUDA. ROCm is AMD's platform for GPU-accelerated computing. You must ensure your environment has the ROCm-compatible version of PyTorch installed.

3. The Training Material (JSON/JSONL Format)

You mentioned having roughly 1,000 training items. This is an ideal size for "Instruction Fine-Tuning"—teaching the model a specific tone, task, or way of answering.

The data should be formatted as JSON Lines (.jsonl), where each line is a valid JSON object representing a single conversation or task. The standard format used by most modern training libraries (like Hugging Face's TRL) is the conversational or "ChatML" format:

{"messages": [{"role": "system", "content": "You are a helpful customer support AI."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "To reset your password, navigate to the settings page..."}]}
{"messages": [{"role": "system", "content": "You are a helpful customer support AI."}, {"role": "user", "content": "Where is my invoice?"}, {"role": "assistant", "content": "You can find your invoices under the 'Billing' tab in your dashboard."}]}

Note: With 1,000 high-quality examples, you are aiming for quality over quantity. Ensure there are no formatting errors, typos, or incorrect answers in your JSON dataset.

4. The Creation Process (On AMD GPUs)

Here is a high-level overview of the training script execution:

Environment Setup: Install the ROCm version of PyTorch, along with Hugging Face libraries: transformers, peft (Parameter-Efficient Fine-Tuning), trl (Transformer Reinforcement Learning), and datasets.
Load the Base Model: Load your chosen model from Hugging Face into your AMD GPU memory. You will typically load it in 16-bit precision (bfloat16) to save memory while maintaining speed on AMD Instinct GPUs.
Configure the LoRA: You will define the LoRA configuration using the peft library. Key parameters include:
- r (Rank): Typically set to 8, 16, or 32. This defines the "size" and learning capacity of your LoRA.
- lora_alpha: Usually set to 2x the rank. This scales the LoRA's influence.
- target_modules: Which parts of the neural network to attach the LoRA to (usually ["q_proj", "v_proj"] or all linear layers).
Run the Trainer: Using the SFTTrainer (Supervised Fine-Tuning Trainer), you pass in your base model, your LoRA configuration, and your JSON dataset. The trainer handles the batching and updates. Training 1,000 examples over 3 epochs on a modern AMD GPU (like an MI250 or MI300) will likely take less than an hour.

5. Evaluation Using RAG

Fine-tuning teaches a model how to answer (style and behavior), but it is generally poor at teaching new facts. To evaluate if your LoRA improved the model without causing it to hallucinate or slow down, you can use a Retrieval-Augmented Generation (RAG) pipeline combined with an "LLM-as-a-Judge" framework (like Ragas or TruLens).

In a rigorous evaluation, you will compare two distinct setups against your evaluating model (the Judge):

The Baseline Model (or Reference Model): This is your original, un-fine-tuned base model hooked up to the RAG pipeline.
The Candidate Model (or Test Model): This is your newly trained LoRA combined with the base model, hooked up to the exact same RAG pipeline.

The RAG Evaluation Workflow:

The Test Set: Keep ~100 of your 1,000 items separate as a test set.
Execution: Ask both the Baseline Model and the Candidate Model a question from the test set. The RAG system retrieves the relevant factual document from your database and feeds it to both models to formulate their respective answers.
The Judge: Feed both answers, the original user question, and the retrieved factual document to a stronger "Judge" model (like GPT-4, Claude, or a larger Llama-3-70B model).
Scoring Quality: The Judge model evaluates the Candidate against the Baseline based on specific qualitative metrics:
- Faithfulness: Did the LoRA hallucinate, or did it stick strictly to the retrieved RAG documents just as well as (or better than) the Baseline?
- Answer Relevance: Did the LoRA actually answer the prompt in the style you trained it on better than the Baseline?
Scoring Performance (Tokens/Second): Alongside the Judge's qualitative score, use system monitoring to compare the generation speed (tokens per second) of both setups. Applying a LoRA adapter adds a small amount of compute; this step ensures the Candidate Model's latency remains acceptable for production use compared to the bare Baseline Model.

6. Combining and Hosting the Finished Model

Once training and evaluation are complete, you have two sets of weights: the massive Base Model (e.g., 15GB) and the tiny LoRA adapter (e.g., 100MB).

Step 1: Merging (Combining) While you can load them separately at runtime, for production hosting, it is vastly more efficient to "merge" them. Using a Python script, you load the base model, apply the LoRA, and use the command model.merge_and_unload(). This mathematically bakes your LoRA weights permanently into the base model's weights. You then save this new, combined model to your disk.

Step 2: Hosting To host this newly combined model so your applications can talk to it via an API (like OpenAI's API format), you will use an inference engine.

vLLM or Text Generation Inference (TGI) are the industry standards.
Both frameworks have excellent, native support for AMD ROCm.
You will launch the vLLM server on your AMD machine, pointing it at your merged model folder. It will expose a local endpoint (e.g., http://localhost:8000/v1/chat/completions) that your RAG applications, web apps, or team members can query directly.