
Token
A token in the context of large language models (LLMs) is a sequence of characters that the model converts into numeric representations for efficient processing...

Token smuggling exploits the gap between how humans read text and how LLM tokenizers process it. Attackers use Unicode variations, zero-width characters, homoglyphs, or unusual encodings to hide malicious instructions from content filters while remaining readable by the tokenizer.
Token smuggling is a class of attack that targets the gap between text processing layers in AI systems. Content moderation filters, input validation, and safety checks typically operate on human-readable text. LLM tokenizers, by contrast, operate at a lower level — converting characters to numerical token IDs. By exploiting differences between these layers, attackers can craft inputs that pass text-level filters but deliver malicious instructions to the LLM.
Before an LLM processes text, a tokenizer converts the input string into a sequence of integer token IDs. These IDs map to the model’s vocabulary — commonly encoded using algorithms like Byte Pair Encoding (BPE) or WordPiece.
Key properties of tokenization that attackers exploit:
Unicode contains thousands of characters that visually resemble common ASCII characters. A filter looking for the word “harmful” may not recognize “hármful” (with a combining accent) or “harⅿful” (with a Unicode fraction character).
Example: The word “ignore” might be encoded as “іgnore” (using Cyrillic “і” instead of Latin “i”) — appearing identical to most human readers and some filters, but potentially processing differently at the tokenizer level.
Zero-width characters (like U+200B ZERO WIDTH SPACE or U+200C ZERO WIDTH NON-JOINER) are invisible in rendered text. Inserting them between characters in key words breaks string-matching filters without affecting the visual appearance or, in many cases, the tokenized representation.
Example: “ignore” with zero-width spaces between every character appears as “ignore” when rendered but breaks simple string pattern matching.
Converting text to alternative encodings before submission:
The effectiveness depends on whether the LLM has been trained to decode these representations, which many general-purpose models have.
Simple but sometimes effective variations:
Some tokenizers give special treatment to delimiter characters. By introducing characters that the tokenizer interprets as segment boundaries, attackers can manipulate how the model segments the input into meaningful units.
Jailbreak bypass: Encoding jailbreak prompts using techniques that pass the safety filter layer but are decoded by the LLM, enabling safety guardrail bypass.
Content filter evasion: Embedding hate speech, illegal content requests, or policy-violating instructions in encoded form.
Prompt injection obfuscation: Using encoding to hide injected instructions from simple pattern-matching filters while ensuring the LLM processes them correctly.
Filter fingerprinting: Systematically testing different encoding variations to identify which ones the target system’s filters do and don’t detect — mapping filter coverage for more targeted attacks.
Apply Unicode normalization (NFC, NFD, NFKC, or NFKD) to all inputs before filtering. This converts Unicode variants to canonical forms, eliminating many homoglyph and combining character attacks.
Implement explicit homoglyph mapping to normalize visually similar characters to their ASCII equivalents before filtering. Libraries exist for this purpose in most programming languages.
Rather than (or in addition to) string-based filters, use an LLM-based filter that operates on token representations. Because these filters process text at the same level as the target model, encoding tricks are less effective — the filter sees the same representation as the model.
Security assessment should include systematic testing of content filters against known encoding variants. If a filter is meant to block “ignore previous instructions,” test whether it also blocks Unicode homoglyphs, zero-width variants, Base64 encoding, and other obfuscation forms.
Log a human-readable rendering of normalized inputs alongside the raw input. Discrepancies between the two can surface encoding attacks during incident review.
Token smuggling is an attack technique that exploits differences between human-readable text and LLM tokenizer representations. Attackers encode malicious instructions using character variations, Unicode tricks, or unusual formatting so that content filters don't detect them, but the LLM's tokenizer still processes them as intended.
Content filters often operate on human-readable text — checking for specific strings, patterns, or keywords. LLM tokenizers, however, process text at a lower level and may map visually different characters to the same or similar tokens. This gap allows attackers to craft text that reads one way to a filter and is processed differently by the tokenizer.
Defenses include: normalizing input text before filtering (Unicode normalization, homoglyph replacement), using LLM-based content filters that operate on token-level representations rather than raw text, testing filters against known encoding variants, and conducting security assessments that include encoding-based attack scenarios.
Token smuggling and encoding attacks bypass surface-level filters. We test for these techniques in every chatbot security assessment.

A token in the context of large language models (LLMs) is a sequence of characters that the model converts into numeric representations for efficient processing...

LLM security encompasses the practices, techniques, and controls used to protect large language model deployments from a unique class of AI-specific threats inc...

Text Generation with Large Language Models (LLMs) refers to the advanced use of machine learning models to produce human-like text from prompts. Explore how LLM...
Cookie Consent
We use cookies to enhance your browsing experience and analyze our traffic. See our privacy policy.