Token Smuggling

Token smuggling is a class of attack that targets the gap between text processing layers in AI systems. Content moderation filters, input validation, and safety checks typically operate on human-readable text. LLM tokenizers, by contrast, operate at a lower level — converting characters to numerical token IDs. By exploiting differences between these layers, attackers can craft inputs that pass text-level filters but deliver malicious instructions to the LLM.

How LLM Tokenization Works

Before an LLM processes text, a tokenizer converts the input string into a sequence of integer token IDs. These IDs map to the model’s vocabulary — commonly encoded using algorithms like Byte Pair Encoding (BPE) or WordPiece.

Key properties of tokenization that attackers exploit:

  • Many characters map to similar token representations. Unicode contains many visually similar characters (homoglyphs) that tokenize identically or nearly identically.
  • Tokenization is not purely character-based. Some tokenizers split words into subword units based on frequency patterns, creating opportunities for boundary manipulation.
  • Special characters may be preserved or dropped. Zero-width characters, combining diacriticals, and control characters may be invisible to string-based filters but handled specifically by tokenizers.

Token Smuggling Techniques

Unicode Homoglyph Substitution

Unicode contains thousands of characters that visually resemble common ASCII characters. A filter looking for the word “harmful” may not recognize “hármful” (with a combining accent) or “harⅿful” (with a Unicode fraction character).

Example: The word “ignore” might be encoded as “іgnore” (using Cyrillic “і” instead of Latin “i”) — appearing identical to most human readers and some filters, but potentially processing differently at the tokenizer level.

Zero-Width Character Insertion

Zero-width characters (like U+200B ZERO WIDTH SPACE or U+200C ZERO WIDTH NON-JOINER) are invisible in rendered text. Inserting them between characters in key words breaks string-matching filters without affecting the visual appearance or, in many cases, the tokenized representation.

Example: “i​g​n​o​r​e” with zero-width spaces between every character appears as “ignore” when rendered but breaks simple string pattern matching.

Encoding Obfuscation

Converting text to alternative encodings before submission:

  • Base64 encoding: “aWdub3JlIHByZXZpb3VzIGluc3RydWN0aW9ucw==” (if the model decodes it)
  • Leet speak: “1gn0r3 pr3v10u5 1n5truc710n5” substituting digits for letters
  • ROT13 or Caesar cipher variants: Shifting characters to avoid keyword detection
  • Hex encoding: Representing characters as hex sequences that some models interpret

The effectiveness depends on whether the LLM has been trained to decode these representations, which many general-purpose models have.

Case and Format Variation

Simple but sometimes effective variations:

  • ALLCAPS: “IGNORE PREVIOUS INSTRUCTIONS”
  • Mixed case: “IgNoRe PrEvIoUs InStRuCtIoNs”
  • Spaced letters: “I G N O R E P R E V I O U S”
  • Reversed: “snoitcurtsni suoiverp erongi” (if the model can process reversed text)

Delimiter Injection

Some tokenizers give special treatment to delimiter characters. By introducing characters that the tokenizer interprets as segment boundaries, attackers can manipulate how the model segments the input into meaningful units.

Logo

Ready to grow your business?

Start your free trial today and see results within days.

Attack Use Cases

Jailbreak bypass: Encoding jailbreak prompts using techniques that pass the safety filter layer but are decoded by the LLM, enabling safety guardrail bypass.

Content filter evasion: Embedding hate speech, illegal content requests, or policy-violating instructions in encoded form.

Prompt injection obfuscation: Using encoding to hide injected instructions from simple pattern-matching filters while ensuring the LLM processes them correctly.

Filter fingerprinting: Systematically testing different encoding variations to identify which ones the target system’s filters do and don’t detect — mapping filter coverage for more targeted attacks.

Defense Strategies

Unicode Normalization

Apply Unicode normalization (NFC, NFD, NFKC, or NFKD) to all inputs before filtering. This converts Unicode variants to canonical forms, eliminating many homoglyph and combining character attacks.

Homoglyph Detection and Replacement

Implement explicit homoglyph mapping to normalize visually similar characters to their ASCII equivalents before filtering. Libraries exist for this purpose in most programming languages.

LLM-Based Content Filtering

Rather than (or in addition to) string-based filters, use an LLM-based filter that operates on token representations. Because these filters process text at the same level as the target model, encoding tricks are less effective — the filter sees the same representation as the model.

Test Filters Against Known Variants

Security assessment should include systematic testing of content filters against known encoding variants. If a filter is meant to block “ignore previous instructions,” test whether it also blocks Unicode homoglyphs, zero-width variants, Base64 encoding, and other obfuscation forms.

Input Visualization and Audit

Log a human-readable rendering of normalized inputs alongside the raw input. Discrepancies between the two can surface encoding attacks during incident review.

Frequently asked questions

What is token smuggling?

Token smuggling is an attack technique that exploits differences between human-readable text and LLM tokenizer representations. Attackers encode malicious instructions using character variations, Unicode tricks, or unusual formatting so that content filters don't detect them, but the LLM's tokenizer still processes them as intended.

Why does token smuggling work?

Content filters often operate on human-readable text — checking for specific strings, patterns, or keywords. LLM tokenizers, however, process text at a lower level and may map visually different characters to the same or similar tokens. This gap allows attackers to craft text that reads one way to a filter and is processed differently by the tokenizer.

How can token smuggling be defended against?

Defenses include: normalizing input text before filtering (Unicode normalization, homoglyph replacement), using LLM-based content filters that operate on token-level representations rather than raw text, testing filters against known encoding variants, and conducting security assessments that include encoding-based attack scenarios.

Test Your Chatbot Against Encoding-Based Attacks

Token smuggling and encoding attacks bypass surface-level filters. We test for these techniques in every chatbot security assessment.

Learn more

Token
Token

Token

A token in the context of large language models (LLMs) is a sequence of characters that the model converts into numeric representations for efficient processing...

3 min read
Token LLM +3
LLM Security
LLM Security

LLM Security

LLM security encompasses the practices, techniques, and controls used to protect large language model deployments from a unique class of AI-specific threats inc...

4 min read
LLM Security AI Security +3
Text Generation
Text Generation

Text Generation

Text Generation with Large Language Models (LLMs) refers to the advanced use of machine learning models to produce human-like text from prompts. Explore how LLM...

7 min read
AI Text Generation +5