This is one of those areas where the gap between theory and practice is enormous. The frameworks and best practices look clean in blog posts, but the actual implementation in real orgs involves dozens of edge cases and tradeoffs that the frameworks don't address. I'd be curious to hear from anyone who's gone through this at scale — what surprised you most about the gap between how it's supposed to work and how it actually works?
One (expensive) vendor claimed to support this stuff, and it flat out didn't work. After lots and lots of escalations, the fix ended up being with a massive terraform diff for us to deploy to prod.
The problem had nothing to do with our network infrastructure, deployment, etc, etc, it was completely their end.
I get the impression this standard is 10x more complicated than it needs to be, but the stuff it replaced was 100x too complicated.
Honestly though, I feel like it's all just a big hack working around splunk and similar systems' inability to understand summary statistics.
Less politely: If I had shell on the telemetry servers, I could probably get the same functionality from a perl one-liner that's easier to understand + maintain.