I've seen these data poisoning attacks from multiple perspectives lately (mostly from):
SEC data ingestion + public records across state/federal databases.
I believe it is possible to reduce the data poisoning from these sources by applying a layered approach like the OP, but I believe it needs many more dimensions with scoring to model true adversaries with loops for autonomous quarantine->processing->ingesting->verification->research->continue to verification or quarantine->then start again for all data that gets added after the initial population.
Also, for: "1. Map every write path into your knowledge base. You can probably name the human editors. Can you name all the automated pipelines — Confluence sync, Slack archiving, SharePoint connectors, documentation build scripts? Each is a potential injection path. If you can’t enumerate them, you can’t audit them."
I recommend scoring for each source with different levels of escalation for all processes from official vs user-facing sources. That addresses issues starting from the core vs allowing more access from untrusted sources.
I believe it is possible to reduce the data poisoning from these sources by applying a layered approach like the OP, but I believe it needs many more dimensions with scoring to model true adversaries with loops for autonomous quarantine->processing->ingesting->verification->research->continue to verification or quarantine->then start again for all data that gets added after the initial population.
Also, for: "1. Map every write path into your knowledge base. You can probably name the human editors. Can you name all the automated pipelines — Confluence sync, Slack archiving, SharePoint connectors, documentation build scripts? Each is a potential injection path. If you can’t enumerate them, you can’t audit them."
I recommend scoring for each source with different levels of escalation for all processes from official vs user-facing sources. That addresses issues starting from the core vs allowing more access from untrusted sources.