I think it is better to write these down and ask people on their thoughts.
Questions
Overall taxonomy
Whether to even finetune
Should you fine-tune LLMs?
- https://www.quora.com/Should-you-fine-tune-an-LLM-or-just-do-prompt-engineering/answer/Tong-Hui-Kang-1
- You should always start with the simplest solution.
- The simplest solution is prompt engineering. See these arguments for simplicity.
- By shipping, you get to see what prompt engineering can achieve. You also have access to data that could be used to fine-tune your model.
- Fine-tuned models may be cheaper for the same performance.
- There are claims where you can achieve GPT-4 performance with a 7B model which can be served for cheap. However, the benchmark you need to beat now is Claude-3-Haiku, which is $0.25/$1.25.
- General-purpose LLMs are only getting cheaper over time, whereas the cost to serve your fine-tuned model is the same.
- General-purpose LLMs are easier to serve at scale because they can batch multiple requests.
- You should also consider the cost of shipping a solution with fine-tuning.
- You need to think of how much you are getting paid. If you spend one engineer-month to fine-tune something to reduce a yearly cost of $500 to $200, are you really saving costs?
- Fine-tuning may not solve your problem if prompt engineering can’t.
- If you cannot prompt engineer your way to generate a good response at least 10% of the time, fine-tuning will not save you.
- Fine-tuning only changes the style of the response. Fine-tuning does not increase the capabilities of the model or truly give the model new knowledge.
- Try squeezing performance out of prompt engineering and function calling first.
- You might be not correctly doing prompt engineering.
- You can provide feedback to general-purpose LLM. If you want to build a chatbot that writes a certain flavor of SQL, just execute the SQL and provide the error feedback to the chatbot. General-purpose LLMs may just adopt the feedback pretty well.
- There are many pitfalls that come along with fine-tuning.
- For prompt engineering, it is obvious if it works. If it does not work, change the prompt.
- For fine-tuning, if it does not work, it is not easy for a beginner to know what went wrong. There are many places where it could go wrong.
- Collecting a fine-tuning dataset is hard.
- Collecting a good fine-tuning dataset is not easy
- If writing a 100-word prompt is hard, collecting a 100-example fine-tuning dataset is even harder.