UnicodeMetadata
The UnicodeMetadata
class provides low-level utilities for embedding and extracting metadata using Unicode variation selectors. This class is primarily used internally by the MetadataEncoder
but can be used directly for advanced customization.
Overview
Unicode variation selectors are special characters that modify the appearance of the preceding character. EncypherAI uses a specific range of these selectors (VS1-VS256) to encode binary data within text without changing its visible appearance.
The UnicodeMetadata
class provides methods to:
- Convert binary data to and from variation selectors
- Find suitable targets in text for embedding metadata
- Embed and extract metadata at the character level
Class Definition
class UnicodeMetadata:
"""
Utilities for embedding and extracting metadata using Unicode variation selectors.
"""
# Unicode variation selector range (VS1-VS256)
VARIATION_SELECTOR_START = 0xFE00 # VS1
VARIATION_SELECTOR_END = 0xFE0F # VS16 (first range)
VARIATION_SELECTOR_START_2 = 0xE0100 # VS17
VARIATION_SELECTOR_END_2 = 0xE01EF # VS256 (second range)
Class Methods
to_variation_selector
@classmethod
def to_variation_selector(cls, byte: int) -> Optional[str]:
"""
Convert a byte value (0-255) to a Unicode variation selector.
Args:
byte: Integer value between 0-255
Returns:
Unicode variation selector character or None if the byte is out of range
"""
from_variation_selector
@classmethod
def from_variation_selector(cls, char: str) -> Optional[int]:
"""
Convert a Unicode variation selector to its byte value.
Args:
char: Unicode variation selector character
Returns:
Integer value between 0-255 or None if the character is not a variation selector
"""
is_variation_selector
@classmethod
def is_variation_selector(cls, char: str) -> bool:
"""
Check if a character is a Unicode variation selector.
Args:
char: Character to check
Returns:
True if the character is a variation selector, False otherwise
"""
find_targets
@classmethod
def find_targets(
cls,
text: str,
target: Union[str, MetadataTarget] = "whitespace"
) -> List[int]:
"""
Find suitable target positions in text for embedding metadata.
Args:
text: The text to analyze
target: Where to embed metadata. Can be a string ("whitespace", "punctuation",
"first_letter", "last_letter", "all_characters") or a MetadataTarget enum.
Returns:
List of character indices that are suitable for embedding metadata
"""
embed_bytes
@classmethod
def embed_bytes(
cls,
text: str,
data: bytes,
target: Union[str, MetadataTarget] = "whitespace"
) -> str:
"""
Embed binary data into text using Unicode variation selectors.
Args:
text: The text to embed data into
data: Binary data to embed
target: Where to embed data. Can be a string ("whitespace", "punctuation",
"first_letter", "last_letter", "all_characters") or a MetadataTarget enum.
Returns:
Text with embedded data
Raises:
ValueError: If there are not enough targets to embed all data
"""
extract_bytes
@classmethod
def extract_bytes(cls, text: str) -> bytes:
"""
Extract binary data embedded in text using Unicode variation selectors.
Args:
text: Text with embedded data
Returns:
Extracted binary data
"""
embed_metadata
@classmethod
def embed_metadata(
cls,
text: str,
model_id: str,
timestamp: int,
custom_metadata: Dict[str, Any] = {},
hmac_secret_key: Optional[str] = None
) -> Tuple[str, bool]:
"""
Embed metadata into text using Unicode variation selectors.
Args:
text: The text to embed metadata into
model_id: Model identifier
timestamp: Unix/Epoch timestamp
custom_metadata: Additional metadata to embed
hmac_secret_key: Optional HMAC secret key for verification
Returns:
Tuple containing the encoded text and a boolean indicating whether the embedding was successful
"""
extract_metadata
@classmethod
def extract_metadata(cls, text: str) -> Tuple[Dict[str, Any], bool]:
"""
Extract metadata embedded in text using Unicode variation selectors.
Args:
text: Text with embedded metadata
Returns:
Tuple containing the extracted metadata and a boolean indicating whether the extraction was successful
"""
MetadataTarget Enum
The MetadataTarget
enum defines the possible targets for embedding metadata:
class MetadataTarget(Enum):
"""
Enum for specifying where to embed metadata in text.
"""
WHITESPACE = "whitespace"
PUNCTUATION = "punctuation"
FIRST_LETTER = "first_letter"
LAST_LETTER = "last_letter"
ALL_CHARACTERS = "all_characters"
Usage Example
from encypher.core.unicode_metadata import UnicodeMetadata, MetadataTarget
import time
# Sample text
text = "This is a sample text for embedding metadata."
# Find suitable targets for embedding
whitespace_targets = UnicodeMetadata.find_targets(text, MetadataTarget.WHITESPACE)
punctuation_targets = UnicodeMetadata.find_targets(text, MetadataTarget.PUNCTUATION)
print(f"Whitespace targets: {len(whitespace_targets)} positions")
print(f"Punctuation targets: {len(punctuation_targets)} positions")
# Embed metadata
metadata = {
"model_id": "gpt-4",
"timestamp": int(time.time()), # Unix/Epoch timestamp
"version": "1.0.0"
}
encoded_text = UnicodeMetadata.embed_metadata(
text=text,
model_id=metadata["model_id"],
timestamp=metadata["timestamp"],
custom_metadata={"version": metadata["version"]},
hmac_secret_key="your-secret-key" # Optional: Only needed for HMAC verification
)
print("\nOriginal text:")
print(text)
print("\nEncoded text (looks identical but contains embedded data):")
print(encoded_text)
# Extract the metadata
extracted_metadata = UnicodeMetadata.extract_metadata(encoded_text)
print(f"\nExtracted metadata: {extracted_metadata}")
# If you used an HMAC secret key and want to verify the metadata
from encypher.core.metadata_encoder import MetadataEncoder
encoder = MetadataEncoder(hmac_secret_key="your-secret-key")
metadata_dict, is_verified = encoder.extract_verified_metadata(encoded_text)
print(f"\nVerified: {is_verified}")
# Demonstrate variation selector conversion
byte_value = 65 # ASCII 'A'
vs_char = UnicodeMetadata.to_variation_selector(byte_value)
print(f"\nByte value {byte_value} converted to variation selector: U+{ord(vs_char):04X}")
back_to_byte = UnicodeMetadata.from_variation_selector(vs_char)
print(f"Variation selector converted back to byte: {back_to_byte}")
Implementation Details
Variation Selector Ranges
Unicode defines two ranges of variation selectors:
- VS1-VS16: Code points U+FE00 to U+FE0F
- VS17-VS256: Code points U+E0100 to U+E01EF
EncypherAI uses these ranges to encode binary data, allowing for up to 256 different values (0-255) to be encoded.
Embedding Process
When embedding metadata, the UnicodeMetadata
class:
- Converts the metadata to a JSON string
- Compresses the JSON string using zlib
- Finds suitable targets in the text based on the specified target type
- Embeds each byte of the compressed data as a variation selector after a suitable target character
Extraction Process
When extracting metadata, the UnicodeMetadata
class:
- Scans the text for variation selectors
- Converts each variation selector back to its byte value
- Reconstructs the compressed data
- Decompresses the data using zlib
- Parses the JSON string to recover the original metadata
Advanced: Embedding Timestamps
The UnicodeMetadata
class provides special handling for timestamps:
from encypher.core.unicode_metadata import UnicodeMetadata
import time
# Sample text
text = "This is a sample text for embedding metadata with timestamps."
# Create metadata with a timestamp
metadata = {
"model_id": "gpt-4",
"timestamp": int(time.time()), # Unix/Epoch timestamp
"version": "1.0.0"
}
# Embed metadata
is_valid, encoded_text = UnicodeMetadata.embed_metadata(
text=text,
model_id=metadata["model_id"],
timestamp=metadata["timestamp"],
custom_metadata={"version": metadata["version"]}
)
print(f"Embedding successful: {is_valid}")
print(f"Encoded text: {encoded_text}")
# Extract metadata
is_valid, extracted_metadata = UnicodeMetadata.extract_metadata(encoded_text)
print(f"Extraction successful: {is_valid}")
print(f"Extracted metadata: {extracted_metadata}")
Related Classes
MetadataEncoder
: Higher-level interface for embedding and extracting metadataStreamingMetadataEncoder
: For handling streaming content