I think it is better to write these down and ask people on their thoughts.

Questions

Overall taxonomy

Whether to even finetune
How to finetune
- Platform - open source or closed source
- Model - which model, base model or instruct model
- Reward algorithm
- Quantization method
- Data
- Hyperparameters
How to evaluate finetunes

Whether to even finetune

Should you fine-tune LLMs?

https://www.quora.com/Should-you-fine-tune-an-LLM-or-just-do-prompt-engineering/answer/Tong-Hui-Kang-1
You should always start with the simplest solution.
- The simplest solution is prompt engineering. See these arguments for simplicity.
- By shipping, you get to see what prompt engineering can achieve. You also have access to data that could be used to fine-tune your model.
Fine-tuned models may be cheaper for the same performance.
- There are claims where you can achieve GPT-4 performance with a 7B model which can be served for cheap. However, the benchmark you need to beat now is Claude-3-Haiku, which is $0.25/$1.25.
- General-purpose LLMs are only getting cheaper over time, whereas the cost to serve your fine-tuned model is the same.
- General-purpose LLMs are easier to serve at scale because they can batch multiple requests.
You should also consider the cost of shipping a solution with fine-tuning.
- You need to think of how much you are getting paid. If you spend one engineer-month to fine-tune something to reduce a yearly cost of $500 to $200, are you really saving costs?
Fine-tuning may not solve your problem if prompt engineering can’t.
- If you cannot prompt engineer your way to generate a good response at least 10% of the time, fine-tuning will not save you.
- Fine-tuning only changes the style of the response. Fine-tuning does not increase the capabilities of the model or truly give the model new knowledge.
Try squeezing performance out of prompt engineering and function calling first.
- You might be not correctly doing prompt engineering.
- You can provide feedback to general-purpose LLM. If you want to build a chatbot that writes a certain flavor of SQL, just execute the SQL and provide the error feedback to the chatbot. General-purpose LLMs may just adopt the feedback pretty well.
There are many pitfalls that come along with fine-tuning.
- For prompt engineering, it is obvious if it works. If it does not work, change the prompt.
- For fine-tuning, if it does not work, it is not easy for a beginner to know what went wrong. There are many places where it could go wrong.
Collecting a fine-tuning dataset is hard.
- Collecting a good fine-tuning dataset is not easy
- If writing a 100-word prompt is hard, collecting a 100-example fine-tuning dataset is even harder.