Table of Contents
LLMs have become central to modern AI workflows. However, adapting them to specific business needs often sounds like a task reserved for companies with deep pockets and large engineering teams.
We wanted to test that assumption.
In this blog, we’ll walk through our internal fine-tuning experiment, which was designed to test how far you can get with open-source models, limited infrastructure, and a structured, two-phase approach.
While not intended for production, this process helped us evaluate what’s realistically achievable before investing in a real-world implementation.
From local experimentation with lightweight models to scalable fine-tuning using LoRA and cloud-based GPUs, our goal was to answer a simple question: how much does it cost to fine-tune an LLM and can you adapt LLMs efficiently without overspending?
What follows is a breakdown of the tools, infrastructure, challenges, and results we encountered along the way.
Note: This information is based on our whitepaper on LLM Adaptation: Fine-tuning as a strategy for medium scale projects.
Phase I: Starting Small – Fine-Tuning a Lightweight Model Locally
We began our LLM adaptation process by working with a smaller, open-source model that could be fine-tuned using local hardware.
The goal was to validate the approach, understand the tooling, and identify baseline limitations before committing to larger models or expensive infrastructure.
Tools and Frameworks Used
To manage this initial phase, we used:
- PyTorch: Our deep learning framework of choice, with CUDA support
- Transformers (Hugging Face): For loading and working with pretrained models
- Accelerate: To simplify distributed training
- Scikit-learn: Mainly for dataset splitting
- Pandas: For data processing
- Datasets (Hugging Face): For data loading and formatting
Local Infrastructure
All training was performed on a modest local machine with:
- 32GB of RAM
- 2x GeForce GTX 1070 GPUs (8GB each)
Scripted Process
We developed three key Python scripts:
- test_model.py: To assess the baseline model without any training
- train.py: To run the actual fine-tuning using a dataset
- inference.py: To test the model’s behavior after fine-tuning
Our Observations
- The model could be fine-tuned relatively quickly (within a few minutes), but the quality was inconsistent.
- Local hardware was sufficient for experimentation, but unsuitable for scaling.
- Results hinted at learning, but smaller models lacked capacity for reliable generalization.
- We encountered early signs of overfitting when using a small dataset.
This phase helped validate the basic fine-tuning process but also highlighted the limitations of small models and local setups, particularly when trying to encode even a single fact accurately.
Phase II: Fine-Tuning Mid-Sized Models with LoRA and Cloud GPUs
After encountering significant limitations with small models—including overfitting and inconsistent outputs—we transitioned to experimenting with mid-sized open-source LLMs.
This second phase was not about scaling in a production sense, but about assessing whether models like Mistral-7B and Zephyr-7B, when fine-tuned using LoRA, could offer more reliable and coherent results. Given the increased resource demands, we moved from local infrastructure to cloud-based GPUs for both training and inference.
Switching to Cloud-Based Infrastructure
We chose Vast.ai for its flexible pricing and instant access to high-performance GPUs. Other providers like AWS and Vultr required extra verification to access L40S instances.
- Hardware Used: NVIDIA L40S
- Pricing: Approximately $0.60/hour
Choosing the Right Models
We experimented with:
- Mistral-7B-Instruct-v0.1
- Zephyr-7B-beta
These mid-sized models offered a strong balance between baseline performance and adaptability.
Why LoRA?
Rather than performing full fine-tuning (which would require extensive compute and time), we used LoRA to update only specific components of the model. This significantly reduced resource requirements while maintaining effectiveness.
Training with LoRA:
- Took 20–60 minutes for ~200 Q&A pairs
- Required careful adjustment of LoRA parameters (learning rate, rank, dropout, target modules)
- Produced initial signs of learning, though hallucinations were still present in early outputs
Iteration and Outcome
Our first attempt at broader knowledge adaptation yielded inconsistent results. However, once we simplified the task – training the model to learn a single fact at a time (“Scopic was founded in 2006 by Tim Burr”) – we achieved reliable output using 20+ variations of the fact in different contexts.
LoRA allowed us to iterate quickly and cost-effectively, making it a viable option for teams needing focused customization without the cost of full fine-tuning.
Data Preparation: Small Dataset, Big Effort
In theory, adapting a model to learn a single fact sounds simple. In practice, it revealed how much data quality and variation influence fine-tuning outcomes, even when the dataset is small.
The Task: Teach the Model One Fact
We focused on getting the model to consistently answer:
“When was Scopic founded, and by whom?”
Initially, we created a limited set of direct statements, but the model’s responses were vague or inaccurate. To improve consistency, we generated over 20 training examples that embedded the same fact in different contexts, formats, and phrasings.
This included:
- Declarative statements
- Q&A pairs
- Sentences with varied word order
- Instances where the fact appeared indirectly or as part of a narrative
Why This Was Necessary
Without this variation, the model tended to hallucinate, ignore the new knowledge, or revert to generic answers. The experiment reinforced a key insight:
“Successful training requires more than just providing accurate information—the data must be presented in various contexts and formats to enable the model to understand patterns rather than merely memorize phrases.” (Whitepaper, Section 5.1.1)
Overfitting Risks
Small models were particularly prone to overfitting. Without a carefully balanced dataset and controlled training parameters, the model either:
- Memorized the phrasing too narrowly
- Forgot it entirely when asked in a different form
This phase highlighted that data preparation isn’t just a setup step – it’s where much of the real adaptation work happens, especially when working with limited compute and small-scale models.
Lessons Learned from a Resource-Conscious Pipeline
Working with constrained infrastructure and a lean dataset revealed several key insights that shaped our approach – and may help guide others considering LLM adaptation on a budget.
1. Fine-Tuning Small Models Isn't Easier
Fine-tuning lightweight models like FLAN-T5 on local hardware was straightforward from a technical standpoint, but the results didn’t hold up.
Despite shorter training times, smaller models often lacked the capacity to generalize or retain nuanced information, especially when trained on minimal data.
“Small LLMs proved particularly susceptible to overfitting…” (Whitepaper, Section 5.1.3)
2. LoRA Is a Practical Middle Ground
Parameter-efficient fine-tuning with LoRA allowed us to work with stronger models like Mistral-7B using consumer-grade or cloud-based GPUs. It reduced cost and complexity while enabling meaningful adaptation.
“LoRA training requires hours instead of days… deployment complexity is significantly reduced.” (Whitepaper, Section 4.2.4)
3. Data Diversity Matters More Than Size
We found that contextual variety and clarity in the training data matters more than sheer volume. However, the amount of data needed depends on what you’re trying to teach – more complex tasks may still require larger, well-structured datasets.
4. Efficiency Isn’t Just About Compute
True efficiency came from making smart decisions across the board: model size, fine-tuning method, dataset scope, training duration, and infrastructure. Fine-tuning is doable but only when every part of the process is optimized.
“Its application requires careful consideration of data preparation, training methodology, and economic implications.” (Whitepaper, Section 5.0)
Conclusion: What to Build When You’re Building Lean
If you’re wondering how to fine-tune an LLM model, start by knowing that it is possible, but that doesn’t always make it the right choice.
Our internal experiments showed that while methods like LoRA make model adaptation more accessible, the process still requires careful planning, structured data, and significant compute.
In most cases, especially for small to medium-sized teams, the more practical path is to leverage commercial LLMs paired with RAG. This approach offers flexibility, lower up-front costs, and faster iteration without the burden of managing training infrastructure.
Fine-tuning should be reserved for scenarios where:
- You have domain-specific requirements that can’t be met through prompting or retrieval
- You control sensitive data and need full model ownership
- You have the engineering capacity to maintain and monitor your custom model
Otherwise, commercial LLMs and RAG provide a faster, more efficient way to deploy language intelligence and leave fine-tuning for when it’s truly necessary.
Want the full breakdown of model comparisons, GPU pricing, training results, and takeaways from both phases?
Or book a free consultation to see how we can support your LLM strategy.
About Creating the How to Fine-Tune an LLM Without Big Tech Resources Guide
This guide was authored by Angel Poghosyan, and reviewed by Mladen Lazic, Cheif Operations Officer at Scopic.
Scopic provides quality and informative content, powered by our deep-rooted expertise in software development. Our team of content writers and experts have great knowledge in the latest software technologies, allowing them to break down even the most complex topics in the field. They also know how to tackle topics from a wide range of industries, capture their essence, and deliver valuable content across all digital platforms.
Note: This blog’s images are sourced from Freepik.