AI SecurityLLMPrompt InjectionRed Teaming

Prompt Injection: How Attackers Manipulate AI Systems

KH

Kat Ho

Offensive Security Analyst · 18 March 2026

Prompt injection is similar to SQL injection in the sense that both rely on crafted text input to manipulate a system. It occurs when an individual crafts prompts designed to manipulate a model into exposing data, conducting malicious actions, or producing unethical outputs.

The key difference is that prompt injection doesn't require knowledge of code or syntax. It only requires an understanding of how a Large Language Model (LLM) can be manipulated.

LLMs don't run on code. They run on plain-text instructions. This means that any guardrails put in place can potentially be exploited. Guardrails also vary in strength, and the more basic ones can be bypassed even by those with limited technical knowledge. Since LLMs operate on instructions and tokenisation based on probability, there is always a chance they can be broken and manipulated.

Common Manipulation Techniques

  • Encoding tactics such as Base64, URL encoding, and HTML entities can be used to disguise restricted content.
  • Synonyms can bypass models that rely on wordlist-based rules.
  • Role-playing involves convincing the AI to adopt a character or story that leads it to produce outputs it otherwise wouldn't.
  • Typos and text variations are another simple way to slip past word restriction rules.

Securing LLMs

As LLMs become more widely adopted across businesses, having proper guardrails in place is more important than ever.

There are many ways to secure an LLM, but each comes with its own trade-offs. Text-based instruction rules that restrict certain words, for example, can easily be bypassed using synonyms or encoding.

One of the more robust approaches is Reinforcement Learning from Human Feedback (RLHF), a training method that uses human preferences as a reward signal to fine-tune models toward safer and more aligned behaviour. Pairing this with a human-in-the-loop system, where suspicious outputs are flagged and reviewed before reaching the end user, can further reduce the risk of malicious actions slipping through. That said, this approach can be costly and raises valid privacy concerns that need to be addressed.

Beyond model-level defences, ensuring that sensitive data is not fed into the Retrieval Augmented Generation (RAG) system is equally important. The system should only have access to the data it needs to function, minimising the risk of data leakage.

Prompt injection is a growing concern that is going to become more relevant as AI becomes further embedded in everyday business operations. Understanding how these attacks work is the first step toward building safer systems.

How Stealth Cyber Helps

At Stealth Cyber, we don't just talk about AI security. We test it. Our AI Red Team conducts real-world adversarial assessments against LLMs, AI agents, and generative AI systems, testing for the exact techniques outlined above and more.

Our approach goes beyond automated scanning. We manually craft prompt injection attacks, test for data extraction and leakage, simulate role-playing exploits, and probe your AI's guardrails to find where they break. Every engagement produces a detailed, risk-rated report with practical remediation guidance your team can act on immediately.

Our testing methodology aligns to the AIUC-1 standard, covering security, safety, reliability, accountability, and data privacy. Whether you're running a customer-facing chatbot, an internal AI assistant, or an AI-powered workflow, we help you understand exactly how your AI can be exploited and what to do about it.

For organisations that are still in the early stages of AI adoption, our AI Readiness Assessment helps you evaluate your security posture, data governance, and risk appetite before you deploy, so you can move forward with confidence rather than hope.

Is Your AI Secure?

Stealth Cyber's AI Red Team tests your LLMs and AI systems against real-world prompt injection, jailbreaks, data extraction, and more. Find out how your AI holds up before an attacker does.