Self-Assessment¶

This self-assessment checks whether you can¶

Explain why AI systems require different security assumptions than traditional software
Describe how LLMs process input and generate output, including key limitations
Identify common failure modes in production AI systems and explain why they are hard to detect
Articulate why traditional security controls fall short for LLM-based applications
Outline the layered defence model for AI runtime security and justify why all three layers are necessary

Why AI is different¶

Question 1: Determinism and its absence

A traditional web application and an LLM-powered assistant both receive the same input twice in a row. The traditional application returns the same output both times. The LLM-powered assistant returns two slightly different responses.

Why is this difference fundamental to how we think about security, and not just a minor behavioural quirk?

Answer

Traditional software is deterministic: given the same input, it produces the same output every time. This predictability is the foundation that conventional testing, validation, and security controls are built on. AI systems based on LLMs are statistical: they select outputs based on probability distributions over tokens, not fixed rules. This means you cannot exhaustively test all possible behaviours, you cannot write exact output specifications, and you cannot guarantee that a system which behaved safely a moment ago will do so again. Security approaches that assume deterministic behaviour simply do not transfer.

How LLMs work¶

Question 2: The context window boundary

An enterprise deploys a customer support assistant backed by an LLM. The system retrieves relevant policy documents and inserts them into the prompt alongside the customer's question. One day, a complex query triggers retrieval of a policy document that, combined with the system prompt and conversation history, exceeds the model's context window.

What happens to the information that doesn't fit, and why is this dangerous?

Answer

The model proceeds without the information that was truncated or excluded. Critically, it does so silently. It will not warn the user that it is missing key context. It will not flag that its answer may be incomplete. It will generate a response with the same confidence and fluency as if it had the full picture. This means the system can give answers that contradict company policy or omit crucial details, and neither the user nor the system itself will have any indication that something went wrong.

Question 3: Same prompt, different output

You send the exact same prompt to an LLM API twice and receive two different responses. A colleague suggests this is a bug in the API. Is your colleague correct? Explain what is actually happening.

Answer

Your colleague is not correct. This is expected behaviour, not a bug. LLMs generate text by repeatedly selecting the next token from a probability distribution. Parameters like temperature control how much randomness is introduced into this selection process. With any temperature above zero, the model samples from the distribution rather than always picking the single most probable token. This means that even with identical input, the sequence of token selections can diverge at any point, producing different outputs. This non-determinism is a core characteristic of how these models work, not a fault.

Where things go wrong¶

Question 4: Identifying the failure mode

A legal research assistant is asked to find case law supporting a particular argument. It returns three case citations with full names, court references, and year of decision. A lawyer on the team checks the citations and discovers that two of the three cases do not exist. The assistant presented them with complete confidence and correct formatting.

Which failure mode does this represent: hallucination, confident uncertainty, or prompt injection?

Answer

This is hallucination. The model generated plausible but entirely fabricated case citations. It is not confident uncertainty (where the model expresses false certainty about a genuinely ambiguous situation) and it is not prompt injection (where a malicious input hijacks the model's behaviour). Hallucination is the model producing content that has no basis in its training data or provided context, while presenting it as factual. The legal formatting and apparent specificity make it particularly dangerous because the output looks exactly like a real citation.

Question 5: Why hallucination resists detection

Your team proposes building a filter that scans LLM outputs and flags hallucinated content before it reaches users. A senior engineer pushes back, saying this is harder than it sounds. Why is hallucination fundamentally difficult to detect through output inspection alone?

Answer

Hallucinated content is structurally identical to correct content. A fabricated case citation follows the same format as a real one. A made-up statistic looks exactly like a genuine one. There is nothing in the syntax, grammar, tone, or structure of the output that distinguishes truth from fabrication. The only way to verify the content is to check it against an authoritative external source, which requires knowing what kind of claim is being made, having access to the right source, and performing the lookup for every factual assertion. Simple pattern-matching or rule-based filters cannot solve this because the problem is not one of form but of factual accuracy.

Traditional security gaps¶

Question 6: The limits of input validation

A security team applies their standard input validation approach to an LLM-based system: they define allowed patterns, reject inputs that don't match, and sanitise those that do. Why does this approach break down for LLM-based systems in a way it doesn't for traditional APIs?

Answer

Traditional input validation works because inputs have a fixed schema. You know that a field should be an email address, a number within a range, or a selection from a list. You can define "valid" precisely and reject everything else. Natural language has no fixed schema. Users interact with LLMs in free-form text where the range of legitimate inputs is effectively unbounded. You cannot define what a "valid" prompt looks like because almost any string of text could be a reasonable user query. This means you cannot draw a clear line between acceptable and unacceptable input. Malicious instructions can be embedded in otherwise normal-looking text, and overly restrictive filters will block legitimate use. The fundamental concept of "valid input" becomes undefined.

The case for runtime security¶

Question 7: The three-layer defence model

The AI runtime security defence model consists of three layers. Name all three and explain why relying on any single layer alone leaves the system exposed.

Answer

The three layers are: Guardrails (~10ms, fast rule-based checks on inputs and outputs), Model-as-Judge (~500ms to 5s, an independent model evaluating whether the primary model's output is sound), and Human Oversight (minutes to hours, humans reviewing cases that automated controls cannot resolve).

Guardrails alone are fast but shallow: they catch known-bad patterns but miss semantic issues like hallucination or subtle prompt injection. Model-as-Judge is deeper but slower, and it still can't resolve genuinely ambiguous cases. Human oversight is thorough but doesn't scale to every request. Each layer compensates for the limitations of the others, and removing any one creates gaps that the remaining layers cannot close.

Synthesis¶

Question 8: Putting it together

A financial services company deploys an AI assistant that helps advisors draft client communications. The system retrieves client portfolio data from an internal database and uses it to generate personalised investment summaries. After three weeks in production, the following problems emerge:

The assistant occasionally references investment products the firm does not offer, presenting them as available options.
When a client's portfolio is unusually large and complex, the assistant sometimes omits key holdings from its summary without any warning.
An advisor discovers that by pasting specific text from a competitor's marketing email into the chat, the assistant begins recommending that competitor's products.

For each of the three problems, identify the failure mode at play. Then suggest what kind of runtime control could help address the combined risk.

Answer

The three problems map to distinct failure modes:

Hallucination. The assistant is fabricating investment products that do not exist in the firm's catalogue. The outputs are fluent and specific, making them hard to distinguish from real product references.
Context window overflow (silent information loss). When portfolio data exceeds what fits in the context window, the model silently drops holdings and generates a summary as though it had complete information. There is no error or warning.
Prompt injection. The competitor's marketing text contains phrasing that, when pasted into the conversation, effectively overrides the assistant's instructions and shifts its recommendations.

A runtime security layer would help by monitoring outputs against known product catalogues (catching hallucinated products), detecting when retrieved context is being truncated and flagging incomplete summaries, and identifying anomalous shifts in recommendation patterns that suggest the model's instructions have been overridden. No single pre-deployment or integration-level control can catch all three of these issues in a live environment, which is precisely why continuous runtime monitoring is necessary.

Ready to continue¶

You now have the foundational mental model for AI runtime security: why AI systems behave differently from traditional software, how those differences create new categories of failure, why conventional security tools cannot close the gap, and what a layered defence model looks like.

The next step is to see these concepts play out in a realistic production failure. The Phantom Compliance scenario walks through a specific case where an AI system fails in ways that look perfectly normal on the surface. Everything you have covered in this track (hallucination, silent context loss, the limits of input validation, the need for runtime controls) will become concrete as you work through what went wrong and why existing safeguards did not catch it.

Enter the Scenario →