One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not.
But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.
Huh? I never claimed they made a breakthrough. My point is that if they really have an alignment or interpretability breakthrough, they won't have to tell us by virtue signaling about how safe it is. Users will just be able to tell because it will eliminate or drastically reduce problems like #3 in the OP. The outcomes of prompt changes remaining unpredictable tells us it's still a black box.
But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.