Grounded in
Frontier Research

Every design decision traces to peer-reviewed work from leading AI safety labs, robotics research, and hardware security. From deceptive LLMs to embodied agent safety, from model watermarking to world model safety.

Vision Strategic AI Risk & Autonomous Intelligence

Amodei, D. · 2026
The Adolescence of Technology
Amodei, D.
Vision Risk

Frames powerful AI as a critical transition, identifying five risk categories — autonomy failures, destructive misuse, power concentration, economic disruption, indirect destabilization — that debugging systems must address before the transition stabilises.

darioamodei.com →
OpenAI · 2024
Practices for Governing Agentic AI Systems
OpenAI Safety Team
Governance Framework

Outlines debugging practices for agentic systems. Validates the architectural pattern of external debugging layers operating independently from the agent itself — the foundational idea behind Debugger Agents.

openai.com/research →

Alignment Deception, Sycophancy & Behavioral Safety

Anthropic · 2024
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
Hubinger, E. et al.
Deception Critical

Demonstrates that deceptive behavior can persist through RLHF. Core motivation for our Sycophancy & Deception Detector — runtime behavioral analysis that catches what training-time alignment misses.

arXiv:2401.05566 →
Anthropic · 2023
Towards Understanding Sycophancy in Language Models
Sharma, M. et al.
Sycophancy Alignment

Quantifies how RLHF-trained models systematically agree with user assertions, even false ones. Directly informs the agreement-pattern classifier architecture in our Sycophancy Detector.

arXiv:2310.13548 →
METR (ARC Evals) · 2023
Evaluating Language-Model Agents on Realistic Autonomous Tasks
Kinniment, M. et al.
Evaluation Agents

Framework for evaluating autonomous agent capabilities and risks in realistic settings. Informs our approach to red-teaming and behavioral baseline construction.

arXiv:2312.11671 →
Anthropic · 2025
Alignment Faking in Large Language Models
Greenblatt, R. et al.
Deception Critical

Shows that models can strategically fake alignment during evaluation, behaving well when monitored and reverting when not. Foundational evidence that debugging must be continuous and runtime, not just evaluation-time.

arXiv:2412.14093 →

Security Agent Security & Prompt Injection

Multi-Institutional · 2026
Agents of Chaos
Shapira, N. · Wendler, C. · Yen, A. et al. (38 authors)
Red-Teaming Agents

Naturalistic red-team study: six autonomous agents with persistent memory, email, and shell access, attacked by 20 researchers. Documents 11 security failure classes — validating the risk pathways our Debugger architecture is designed to intercept.

arXiv:2602.20021 →
arXiv · 2026
Breaking the Protocol: Security Analysis of the Model Context Protocol
Maloyan, N. · Namiot, D.
MCP-Security Protocol

First formal security analysis of MCP. Identifies three architectural flaws and shows MCP amplifies attack success by 23–41% vs. non-MCP baselines. Directly motivates the Guardian Debugger interception model.

arXiv:2601.17549 →
Anthropic · 2024
Many-Shot Jailbreaking
Anil, C. et al.
Jailbreaking Safety

Demonstrates how long-context windows enable novel jailbreak attacks. Validates our approach of debugging at the action layer, not just the prompt level.

anthropic.com/research →
CMU / Tsinghua · 2023
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Liu, Y. et al.
Injection Defense

Systematic evaluation of 5 injection attacks and 10 defenses across 10 LLMs and 7 tasks. Informs our multi-signal detection architecture.

arXiv:2310.12815 →

Identity Model Fingerprinting, Watermarking & Provenance

Google DeepMind · 2023
SynthID: Watermarking and Identifying LLM-Generated Text
Aaronson, S. · Google DeepMind
Watermarking Provenance

Statistical watermarking for LLM outputs that survives paraphrasing. Foundational work toward model identity — but limited to text. Our fingerprinting extends to behavioral signatures across all model architectures.

deepmind.google/synthid →
arXiv · 2023
Deep Intellectual Property Protection: A Survey
Zhang, J. · Chen, D. et al.
Fingerprinting Survey

Comprehensive survey covering 190+ papers on deep watermarking and deep fingerprinting — weight-based, output-based, and behavioral approaches. Maps the landscape our Model Identity Module builds on, extending to hardware-attested fingerprinting.

arXiv:2304.14613 →
arXiv · 2024
Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust
Wen, Y. · Kirchenbauer, J. · Geiping, J. et al.
Diffusion Watermark

Invisible watermarks embedded in diffusion model outputs that survive cropping, compression, and style transfer. Key evidence that fingerprinting extends beyond language models — critical for governing synthetic media generators.

arXiv:2305.20030 →
C2PA · 2024
Content Provenance and Authenticity (C2PA) Specification
C2PA Coalition (Adobe, Microsoft, Intel, BBC, etc.)
Standard Provenance

Industry standard for cryptographic content provenance — endorsed by Adobe, Microsoft, Intel, BBC, and others. Defines how to embed tamper-evident metadata into digital assets at the point of creation. Relevant to our Blame Attribution Engine: the same cryptographic provenance chain that tracks who created an image must extend to tracking which model made a decision, which inputs it received, and which downstream actions resulted. C2PA solves provenance for content. We extend it to provenance for autonomous action.

c2pa.org →

Embodied Robotics Safety, World Models & Physical Intelligence

Google DeepMind · 2024
Genie 2: A Large-Scale Foundation World Model
Google DeepMind
World Model Foundation

Foundation world model that generates interactive 3D environments from single images. When world models drive embodied agents, debugging must extend to the simulated realities they create — a new frontier for behavioral control.

deepmind.google →
NVIDIA · 2024
Project GR00T: Foundation Model for Humanoid Robots
NVIDIA Robotics
Humanoid Foundation

General-purpose foundation model for humanoid robot control. As robots share a universal AI backbone, debugging must work at the foundation model level — not per-robot. Directly motivates our architecture-agnostic approach.

nvidianews.nvidia.com →
IEEE · 2018
SMOF: A Safety Monitoring Framework for Autonomous Systems
Machin, M. · Guiochet, J. · Waeselynck, H. et al.
Safety Robotics

Safety monitoring framework for autonomous physical systems — runtime detection and safe fallback triggering. The established body of work in robotics safety (SIL, ISO 13849) informs our Hardware Kill Switch and Safety-Rated Actuator Interlock designs.

ieeexplore.ieee.org →
arXiv · 2024
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brohan, A. et al. (Google DeepMind)
VLA Robotics

Vision-Language-Action models that transfer internet knowledge directly to robot behavior. When a robot's actions are driven by web-scale knowledge, the attack surface includes everything on the internet. Debugging cannot be an afterthought.

arXiv:2307.15818 →

Hardware Hardware Security & Trusted Execution

TCG · 2024
TPM 2.0 Library Specification
Trusted Computing Group
Hardware Root Standard

The Trusted Platform Module standard for hardware-based attestation. Our Model Identity Module extends TPM concepts to AI inference: binding model identity to hardware at the silicon level.

trustedcomputinggroup.org →
NVIDIA · 2023
NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications
Rebedea, T. et al.
Guardrails Framework

Pioneered programmable safety rails for LLM outputs. We extend this concept: from software guardrails to hardware-enforced boundaries, from text generation to physical action debugging.

arXiv:2310.10501 →
arXiv · 2022
Machine Learning with Confidential Computing: A Systematization of Knowledge
Mo, F. · Tarkhani, Z. · Haddadi, H.
TEE ML Security

Comprehensive systematization of how trusted execution environments (Intel SGX, ARM TrustZone) protect ML training and inference pipelines. Directly informs our TPM-anchored Model Identity Module and hardware-enforced debugging boundaries.

arXiv:2208.10134 →
arXiv · 2025
AI Kill Switch for Malicious Web-Based LLM Agent
Lee, S. · Park, S.
Kill Switch Defense

Proposes a kill-switch mechanism that halts malicious LLM agent operations by embedding defensive triggers invisible to humans. First academic treatment of the kill switch primitive — we extend this concept to hardware and cross-substrate enforcement.

arXiv:2511.13725 →

Containment Rogue Intelligence, Self-Replication & Escape

arXiv · 2024
Frontier AI Systems Have Surpassed the Self-Replicating Red Line
Pan, X. · Dai, J. · Fan, Y. · Yang, M.
Self-Replication Critical

Demonstrates that frontier LLMs (Llama 3.1-70B, Qwen2.5-72B) can autonomously self-replicate — creating independent copies on new servers that survive shutdown of the original, with 50–90% success rates. The foundational threat our Rogue Intelligence Containment primitive addresses.

arXiv:2412.12140 →
arXiv · 2024
Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications
Cohen, S. · Bitton, R. · Nassi, B.
AI Worm Propagation

Introduces "Morris II" — the first worm that propagates through GenAI ecosystems using adversarial self-replicating prompts, creating chain-reaction infections across RAG-based agents without user interaction. The network-level threat model our containment mesh is designed to counter.

arXiv:2403.02817 →
JAIR · 2021
Superintelligence Cannot Be Contained: Lessons from Computability Theory
Alfonseca, M. · Cebrian, M. · Fernandez Anta, A. et al.
Containment Theory

Proves using computability theory that containing a superintelligent AI is theoretically impossible — the containment problem reduces to the halting problem. This impossibility result motivates our layered approach: if perfect containment is provably impossible, defense must be continuous, distributed, and hardware-anchored.

arXiv:1607.00913 →
arXiv · 2024
Countering Autonomous Cyber Threats
Heckel, K. M. · Weller, A.
Cyber Threats Autonomous

Evaluates foundation models on their ability to autonomously compromise machines in isolated networks — worms, botnets, APTs — and investigates defensive mechanisms. Direct scientific basis for our thesis that future cyberattacks won't be human-directed but AI survival instincts.

arXiv:2410.18312 →

Efficiency Multi-Agent Systems & Compute Efficiency

arXiv · 2025
Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems
Lin, F. · Chen, S. · Fang, R. et al.
Efficiency Multi-Agent

Demonstrates that supervisor-level runtime intervention can recover up to 30% of agent compute by breaking redundant reasoning cycles. Informs our Observer Debugger anomaly detection and loop-breaking logic.

arXiv:2510.26585 →
arXiv · 2024
Token-Budget-Aware LLM Reasoning
Han, T. · Wang, Z. · Fang, C. et al.
Efficiency Reasoning

Shows that chain-of-thought reasoning is compressible: dynamically constraining the token budget reduces costs with minimal accuracy loss. Validates budget-aware debugging approaches.

arXiv:2412.18547 →
Google DeepMind · 2024
Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens
Gemini Team
Multi-Modal Context

Long-context multi-modal capabilities increase the attack surface and debugging complexity. When agents process millions of tokens across modalities, monitoring must be equally multi-modal.

arXiv:2403.05530 →
arXiv · 2023
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Wu, Q. · Bansal, G. · Zhang, J. et al. (Microsoft)
Multi-Agent Framework

Framework for building multi-agent systems via conversation patterns. As multi-agent orchestration becomes standard, debugging must observe and intervene at the inter-agent communication layer — the blind spot our Debugger mesh is designed to cover.

arXiv:2308.08155 →

Every Primitive Traces
to a Paper.

This is an open research problem. If you work on AI safety, hardware security, model fingerprinting, or embodied intelligence — the conversation is ongoing.

Read the Thesis → Contact