This is a very real concern. I've seen quantized models outputting complete garbage in LLMs. In most cases it definitely felt that a smaller unquantized model would do better. They must be included in every comparison.
E.g. compare quantized LLaMA 70B to unquantized LLaMA 8B.
Even better if the test model has a smaller version with similar byte size to the quantized larger one.
E.g. compare quantized LLaMA 70B to unquantized LLaMA 8B.
Even better if the test model has a smaller version with similar byte size to the quantized larger one.