Fine-Tuning a Small LLM for a Specific Job: What I Learned

Everyone is using GPT-4 or Claude for everything right now, and honestly for a lot of tasks that makes sense. The APIs are fast, the models are capable, and the marginal cost is low enough that it is hard to justify doing more.

But there are cases where it is not the right answer. I have been working in environments where data cannot leave the network, where latency requirements are tight, or where the task is specific enough that a large general model is genuinely overkill. That is where fine-tuning smaller models becomes interesting.

Here is what I have actually learned from doing it.

Start with the question: does this actually need fine-tuning?

This sounds obvious but it is easy to skip. Before you touch any training code, try to solve your problem with prompting alone. System prompts, few-shot examples, structured output constraints. A lot of the time that is enough, and you save yourself weeks of work.

Fine-tuning makes sense when: the domain is very specific and the base model keeps making the same category of mistake, when you need consistent output format that prompting alone cannot reliably produce, or when you are running inference at a scale where a smaller fine-tuned model is significantly cheaper than API calls.

It does not make sense when your prompting is just sloppy and a better prompt would fix it.

Data is 80% of the problem

I cannot stress this enough. The model quality is almost entirely a function of data quality. I have spent more time cleaning and curating training data than I have spent on anything else in these projects.

The mistakes I made early on: using too much data that was only loosely related to the task, not being consistent about the output format in the training examples, and including examples where the correct answer was ambiguous. All of these show up in the fine-tuned model as exactly the kind of inconsistency you were trying to eliminate.

What works: 500 to 2000 high-quality, unambiguous examples of exactly the input-output behavior you want. Clean is better than large. Consistent format in every single example. If you would not be confident giving that example to a human to learn from, do not include it in your training set.

Choosing the base model

The choice of base model matters a lot for domain-specific work. A model pre-trained on a lot of code will fine-tune differently than one pre-trained on general text. If your task involves structured data or technical content, starting from a model that has seen a lot of that type of content means you need less fine-tuning data to get good results.

For the kinds of tasks I work on, I have had good results with models in the 7B to 13B parameter range. They are small enough to run on reasonable hardware, large enough to handle nuanced tasks, and the fine-tuning compute is manageable. Anything smaller and you start hitting capability ceilings that fine-tuning cannot fix.

The evaluation problem

How do you know if your fine-tuned model is actually better? This is harder than it sounds.

Automatic metrics like BLEU or perplexity are easy to compute but often do not correlate with what you actually care about. I have had models with great automatic metrics that failed completely on real-world inputs.

What I do instead: build a test set of 50 to 100 examples that represent the real distribution of inputs you will see in production. Evaluate the base model on them. Fine-tune. Evaluate again. Look at the failures manually. The failures will tell you more than any aggregate metric.

Also: always include some out-of-distribution examples in your test set. Fine-tuned models can become fragile. A model that performs perfectly on in-distribution inputs but breaks completely on slight variations is not something you want in production.

Deployment reality

Fine-tuned models need infrastructure that a simple API call does not. You need to think about where the model runs, how it scales, how you update it when you have more training data, and how you monitor its outputs in production.

For most of my deployments I have used quantized versions of the fine-tuned model to reduce memory requirements, served with a lightweight inference server. This gets you to a point where a single reasonably sized machine can handle real production load for many business applications.

But be honest about the maintenance burden. A fine-tuned model is a piece of software that needs to be updated, monitored, and occasionally retrained. If your team does not have the bandwidth for that, a well-configured API call might genuinely be the better answer.

When it is worth it

After doing this a few times, here is my honest assessment of when fine-tuning a smaller model is worth the effort:

It is worth it when the task is narrow and well-defined, when you have or can create high-quality training data, when you need on-premise deployment for data privacy reasons, or when you are at a scale where inference costs are a real factor.

It is probably not worth it when the task is broad or requires general reasoning, when data is scarce or hard to label, or when you are doing this for the first time and have not validated that prompting alone cannot solve the problem.

Start simple. Add complexity only when you can measure that it is helping.