Alright, want to go on a Friday morning journey and learn a bit about LLM fine tuning, but written by an actual human (me)? Here’s what I’ve been doing, come join me, in my own words (no LLMs used for this post!)…
Hunting down a local model
So we’ve been on the hunt for a model that works well, does what it needs to do and ticks all the right boxes. After some extensive testing, we’ve narrowed it down to the Phi-4-mini model. Now Microsoft has already done the “training” part of it (thank you!), it knows how to write, knows english and others, knows formatting and punctuation etc but it’s just a bit, generic. It doesn’t know the style that we need it to write in.
Full fine tuning versus LoRA fine tuning
Now we could go down the route of fully fine tuning the whole model, changing the internals (the weights) of how it works. It can be pretty powerful, but it’s going to be slow, expensive (I don’t have that kind of money or hardware) and we might make it too good at one thing and worse at everything else, which means we end up back at square one.
So in comes something called LoRA (Low-Rank Adaptation)… Think of it like we’re giving the LLM model (Phi-4-mini) a little mathematical cheat sheet that it can use, this is the LoRA Adapter. We can also set how big we want the cheat sheet to be (the Rank - and as the saying goes, bigger isn’t always better) and this gives us a much quicker way to adapt the LLM model to what we want.
Currently this cheat sheet is functionally blank, it’s got nothing useful on it and isn’t doing anything for us.
What we need to pull together
We need to get some notes in here so that the model leans towards the style and output we want, we’re not talking about the prompt you put in, we’re talking about some complex maths. For this stuff, we need data.
So off on a data hunt we go, we need to gather source material (actual transcripts of meetings, recordings, podcasts, news clips, public broadcasting etc - make sure you’re not breaking any copyright!) and then a “gold standard” of what a good meeting document will look like. We’re talking headings, tone, style, tense, formatting, how it interprets information and more, for every. single. one… Quite intensive stuff to put together, and we’re going to need a lot of it to make any meaningful difference.
This is actually the hard (and most boring - for me) part. We can have a small number of really good examples, but we need variation, and it needs to be good otherwise we’re just going to end up going backwards.
So what we’ve got so far:
- The model - Phi-4-mini
- The LoRA Adapter - The cheat sheet of what output we want the model to lean towards (currently blank)
- The LoRA rank value - The size of the cheat sheet
- The data - All our transcripts, with the nicely formatted “gold standard” meeting notes for each one
Come on, give me good output
Now what do we do with it? We just need to tell it to output like these awesome meeting documents right? But what does that actually look like? How do you “tell a model” to output something differently? So what we do is:
- Give our training system the transcript
- Pair the transcript with the “gold standard” meeting document we want
- The training system then checks how likely the model is to produce the “gold standard” document, doing it bit by bit, token by token
- When the model gives us a low probability that we’ll get the words (tokens) we want, the training process updates the LoRA Adapter (the little cheat sheet we have) to nudge the model towards the right output
Now that final bit, “nudging the model”, is the important bit and where the magic is happening, but we have to dig into understanding an LLM model a bit more to get why. I hope your beverage of choice is kicking in!
How an LLM picks the next word (roughly)
When ChatGPT or Claude or any LLM that you’ve heard of is generating output, it’s not really thinking “I know the answer to this”, it just does some very fancy maths to work out what the next word (tokens) might be.
So lets take a hypothetical example, and we’re simplifying it a bit, but if we pass an LLM the sentence “the sun is”, what are the probabilities of the next word? Obviously there could be any number of words next, but lets take the top 5 and put the probability % that they would come up next:
- “shining” - 35%
- “bright” - 25%
- “hot” - 15%
- “setting” - 10%
- “rising” - 5%
I’ve made up the percentage values above, but that doesn’t matter.
So now in this example the model looks at the sentence “the sun is”, then says to itself “Based on everything I’ve seen before, and everything I know about, these are the most likely to come next (the list above). Because ‘shining’ has the highest percentage value, the next word is probably ‘shining’”.
Now, it doesn’t just pick the top one, there’s quite a few other things at play, you may have heard of things such as “temperature” or “top-p” or even “beam search” (if you’re into this stuff) before and we’ll dive into some of those in another post, but this is fundamentally it. It’s just a game of probability.
Nudging it towards our format
So lets bring this back to our meeting notes and training. If the gold standard document always starts with “Meeting Summary:”, how is the model meant to know that? Maybe our prompt says something like “Always start with ‘Meeting Summary:’ in your response, write in the 3rd person, ignore filler conversation” but we’re just hoping that our prompt has effected the model enough that it will start with that, we have no guarantee it will. It’s still going to look at everything it’s been given (the transcript and our prompt which is defining our format and layout) and go:
“Based on what I’ve seen so far, what is the next most likely word (tokens)?”
And it might come up with something like:
- “Summary” - 45%
- “Notes” - 25%
- “Review” - 12%
- “Meeting” - 10%
- “Actions” - 8%
But in our gold standard, the next word we actually wanted was “Meeting” and this only has a 10% probability. Not good.
So the training system we’re running pretty much goes “We wanted ‘Meeting’ to be more likely here, but the model didn’t give it enough probability. What tiny mathematical changes can we make to the LoRA adapter so that next time we see something similar, the word ‘Meeting’ has a better chance?”
This isn’t changing the original internal maths the model is running, but it is giving it some additional maths to use in that little cheat sheet that we’ve given it. It uses this whilst calculating the next word (tokens) probability. So maybe after one tiny adjustment it looks something like:
- “Summary” - 40%
- “Notes” - 22%
- “Review” - 10%
- “Meeting” - 18%
- “Actions” - 8%
We had an 8% increase, success! In reality we don’t have a table like this, but it should help give you an idea.
Around and around we go
But it’s not finished there, it goes through and does this again for the next word, then again, and again and again etc… Remember our “gold standard” document isn’t just one word, it’s the whole meeting document. The tone, headings, structure, style of action points, what to leave out and everything else in between, nudging the output ever so slightly with those notes it’s put in that little cheat sheet.
And then we do exactly the same to the next transcript and gold document pair, and the next, and the next etc, etc, until we’ve completed one “epoch” and run through them all once.
Then we go all over again and run through them all, slightly nudging the numbers, altering the maths in our little cheat sheet and going around and around again.
Now you don’t want to go on forever, you might overdo it and the model becomes too aligned to the data you’re training on, you also have to stop somewhere. So you keep some of the transcripts and gold standard meeting document pairs out of the training set. We then use these to test the model each time we’ve done a round of nudging, and see how close it is to the output we want on unseen meetings, rather than just memorising the training examples we’ve been giving it.
After the training run
For this example we’ll run through our whole data set 6 times (epochs). Now lets look at what we’ve got:
- The model - Phi-4-mini
- The LoRA Adapter - The cheat sheet full to the brim of maths that helps the model lean towards the output we want
- The LoRA rank value - The size of the cheat sheet
- The data - All our transcripts and gold standard meeting notes. Each one used 6 times during the fine tuning run.
Now when we want to use our Phi-4-mini model, we run the model with the LoRA Adapter (our little cheat sheet with all the fancy maths in it) attached to help guide the model to the output we want.
Whew! That was a lot of work! Now its time to run it on some actual meetings and see what kind of output we get.
Then it’s time to fill the coffee (well, tea for me) cup back up, get more data and keep on tuning.
I hope you found this interesting (and fun) to read, I had a lot more fun writing it out than I thought I would! I’ve simplified and hand waved a few things in this explanation, and LoRA fine tuning isn’t the only piece of the puzzle, there’s a lot of other cool things that have to go on behind the scenes, but that’s for another time.
Version 10 of our fine tuned model will be due for release soon, it’s been churning away in the background. Sign up below to get notified of new releases!
Cheers, Tobias
Hear about new releases
Whistle Enterprise runs entirely offline, so an email is how you hear about a new release. One email per release, unsubscribe in one click.
We only use your address to email you about new Whistle Enterprise releases. See our privacy notice.