This is a very real concern. I've seen quantized models outputting complete garb...

This is a very real concern. I've seen quantized models outputting complete garbage in LLMs. In most cases it definitely felt that a smaller unquantized model would do better. They must be included in every comparison.

E.g. compare quantized LLaMA 70B to unquantized LLaMA 8B.

Even better if the test model has a smaller version with similar byte size to the quantized larger one.