Metadata Encoding
This guide explains how EncypherAI embeds metadata into text using Unicode variation selectors, providing a detailed overview of the encoding process, available options, and best practices.
How Metadata Encoding Works
EncypherAI embeds metadata into text using Unicode variation selectors (VS), which are special characters designed to modify the appearance of the preceding character. These selectors are typically invisible or have minimal visual impact, making them ideal for embedding metadata without changing the visible appearance of the text.
The Encoding Process
- Metadata Preparation: The metadata (a JSON-serializable dictionary) is converted to a binary format
- HMAC Generation: A cryptographic signature is created to ensure data integrity
- Target Selection: Suitable characters in the text are identified as targets for embedding
- Embedding: Variation selectors representing the metadata and HMAC are inserted after target characters
Metadata Targets
EncypherAI provides several options for where to embed metadata in text:
Target Type | Description | Example |
---|---|---|
whitespace |
After spaces, tabs, and newlines | "Hello world" → "Hello␣world" (VS after space) |
punctuation |
After punctuation marks | "Hello, world!" → "Hello,␣world!" (VS after comma) |
first_letter |
After the first letter of each word | "Hello world" → "H␣ello w␣orld" (VS after H and w) |
last_letter |
After the last letter of each word | "Hello world" → "Hello␣ world␣" (VS after o and d) |
all_characters |
After any character | "Hello" → "H␣e␣l␣l␣o␣" (VS after each letter) |
The default target is whitespace
, which provides a good balance between robustness and minimal impact on the text.
Basic Usage
from encypher.core.metadata_encoder import MetadataEncoder
import time
# Create a metadata encoder
encoder = MetadataEncoder()
# Sample text
text = "This is a sample text that will have metadata embedded within it."
# Create metadata
metadata = {
"model_id": "gpt-4",
"organization": "EncypherAI",
"timestamp": int(time.time()), # Unix/Epoch timestamp
"version": "1.0.0"
}
# Embed metadata
encoded_text = encoder.encode_metadata(text, metadata)
print("Original text:")
print(text)
print("\nEncoded text (looks identical but contains embedded metadata):")
print(encoded_text)
# Extract metadata
is_valid, extracted_metadata = encoder.extract_metadata(encoded_text)
print("\nExtracted metadata:")
print(extracted_metadata)
# Verify the text hasn't been tampered with
is_valid, extracted_metadata, clean_text = encoder.verify_text(encoded_text)
print(f"\nVerification result: {'✅ Verified' if is_valid else '❌ Failed'}")
HMAC Verification
EncypherAI uses HMAC (Hash-based Message Authentication Code) to ensure data integrity and detect tampering. When metadata is embedded, an HMAC signature is created using:
- The metadata content
- A secret key (either provided or randomly generated)
This signature is embedded alongside the metadata. When extracting metadata, the HMAC is verified to ensure the content hasn't been modified.
Using a Custom Secret Key
from encypher.core.metadata_encoder import MetadataEncoder
# Create a metadata encoder with a custom secret key
encoder = MetadataEncoder(secret_key="your-secret-key")
# Embed metadata
encoded_text = encoder.encode_metadata(text, metadata)
# Verify using the same secret key
is_valid, extracted_metadata, clean_text = encoder.verify_text(encoded_text)
Verification Process
The verification process involves:
- Extracting the embedded metadata and HMAC
- Recalculating the HMAC using the extracted metadata and the secret key
- Comparing the recalculated HMAC with the embedded HMAC
If they match, the verification succeeds, indicating the content hasn't been tampered with.
Advanced Configuration
Custom Target Selection
You can specify where to embed metadata using either a string or the MetadataTarget
enum:
from encypher.core.metadata_encoder import MetadataEncoder
from encypher.core.unicode_metadata import MetadataTarget
# Using a string
encoder1 = MetadataEncoder()
encoded_text1 = encoder1.encode_metadata(text, metadata, target="punctuation")
# Using the enum
encoder2 = MetadataEncoder()
encoded_text2 = encoder2.encode_metadata(text, metadata, target=MetadataTarget.PUNCTUATION)
Metadata Size Considerations
The amount of metadata you can embed depends on:
- The length of the text
- The number of suitable targets in the text
- The chosen target type
Each byte of metadata requires one suitable target character. If there aren't enough targets, an error will be raised.
# Check if text has enough targets for metadata
from encypher.core.unicode_metadata import UnicodeMetadata, MetadataTarget
import json
# Estimate metadata size (in bytes)
metadata_json = json.dumps(metadata).encode('utf-8')
metadata_size = len(metadata_json)
# Find available targets
targets = UnicodeMetadata.find_targets(text, MetadataTarget.WHITESPACE)
available_targets = len(targets)
print(f"Metadata size: {metadata_size} bytes")
print(f"Available targets: {available_targets}")
print(f"Sufficient targets: {'Yes' if available_targets >= metadata_size else 'No'}")
Handling Target Limitations
If you have limited targets but need to embed larger metadata:
- Use a more inclusive target type:
all_characters
provides the most targets - Reduce metadata size: Include only essential information
- Increase text length: Add more content to provide more targets
- Compress metadata: Use shorter field names or compress values
Comparing Target Types
Different target types offer different trade-offs between capacity and robustness:
Target Type | Capacity | Robustness | Visual Impact | Use Case |
---|---|---|---|---|
whitespace |
Medium | High | Minimal | General purpose |
punctuation |
Low-Medium | High | Minimal | Short texts with punctuation |
first_letter |
Medium | Medium | Minimal | Texts with many words |
last_letter |
Medium | Medium | Minimal | Texts with many words |
all_characters |
High | Low | Minimal | Maximum capacity needed |
from encypher.core.metadata_encoder import MetadataEncoder
from encypher.core.unicode_metadata import UnicodeMetadata, MetadataTarget
import pandas as pd
import time
# Sample text
text = "This is a sample text that will have metadata embedded within it."
# Create metadata
metadata = {
"model_id": "gpt-4",
"timestamp": int(time.time()), # Unix/Epoch timestamp
"version": "1.0.0"
}
# Test different targets
results = []
for target in [t for t in MetadataTarget]:
# Count available targets
targets = UnicodeMetadata.find_targets(text, target)
# Try to embed metadata
encoder = MetadataEncoder()
try:
encoded = encoder.encode_metadata(text, metadata, target=target)
success = True
except ValueError:
success = False
results.append({
"Target": target.value,
"Available Targets": len(targets),
"Embedding Successful": "Yes" if success else "No",
"Encoded Length": len(encoded) if success else None,
"Added Characters": len(encoded) - len(text) if success else None
})
# Display results
pd.DataFrame(results)
Tamper Detection
EncypherAI's HMAC verification can detect various types of tampering:
Example: Detecting Modified Text
from encypher.core.metadata_encoder import MetadataEncoder
# Create a metadata encoder
encoder = MetadataEncoder()
# Embed metadata
encoded_text = encoder.encode_metadata(text, metadata)
# Simulate tampering by modifying the text
tampered_text = encoded_text.replace("sample", "modified")
# Try to verify the tampered text
is_valid, extracted_metadata, clean_text = encoder.verify_text(tampered_text)
print(f"Verification result: {'✅ Verified' if is_valid else '❌ Failed'}")
Example: Detecting Removed Metadata
import re
# Simulate tampering by removing variation selectors
tampered_text = re.sub(r'[\uFE00-\uFE0F\U000E0100-\U000E01EF]', '', encoded_text)
# Try to extract metadata
try:
is_valid, extracted = encoder.extract_metadata(tampered_text)
print("Metadata found:", extracted)
except Exception as e:
print("Metadata extraction failed:", str(e))
Best Practices
- Choose appropriate targets: Use
whitespace
for general purposes,all_characters
for maximum capacity - Limit metadata size: Include only necessary information to minimize embedding overhead
- Use consistent secret keys: Store and reuse secret keys for verification across systems
- Handle extraction errors: Implement proper error handling for cases where metadata is missing or corrupted
- Verify before trusting: Always verify the HMAC before trusting extracted metadata
- Test with your content: Different content types may require different target strategies
Implementation Details
Unicode Variation Selectors
Unicode defines two ranges of variation selectors:
- VS1-VS16: U+FE00 to U+FE0F (16 selectors)
- VS17-VS256: U+E0100 to U+E01EF (240 selectors)
EncypherAI uses both ranges to encode a full byte (0-255).
Metadata Format
The embedded metadata follows this structure:
- Header: Identifies the metadata format and version
- Metadata Length: The size of the metadata in bytes
- Metadata Content: The JSON-serialized metadata
- HMAC: The cryptographic signature for verification
Encoding Algorithm
- Convert metadata to JSON
- Calculate HMAC using the metadata and secret key
- Find suitable targets in the text
- Convert each byte of the header, metadata, and HMAC to variation selectors
- Insert variation selectors after target characters
Decoding Algorithm
- Scan the text for variation selectors
- Convert variation selectors back to bytes
- Parse the header to identify the format and version
- Extract the metadata content and HMAC
- Verify the HMAC using the extracted metadata and secret key