Content Hash Coverage and Embedding Technical Details
This document provides a detailed technical explanation of how EncypherAI's C2PA text embedding approach works, specifically focusing on content hash coverage and the embedding mechanism.
What the Content Hash Covers
The content hash in our implementation covers the plain text content of the article - specifically:
Text Extraction Process
- The code extracts all paragraph text from the article
- It looks for paragraphs in content columns first, then falls back to direct paragraph search
- All paragraph texts are joined with double newlines (
"\n\n"
) - This extracted plain text is saved to
clean_text_for_hashing.txt
as a reference
Hash Generation
- A SHA-256 hash is calculated on this extracted text
- The hash is computed on the UTF-8 encoded version of the text
- This happens before any metadata embedding occurs
Hash Usage
- The hash is included in the manifest as a
stds.c2pa.content.hash
assertion - This assertion includes both the hash value and the algorithm used (sha256)
Important Distinction
- The hash covers only the plain text content, not the HTML markup
- The hash does not include the embedded metadata itself (the Unicode variation selectors)
- This creates a "snapshot" of the original content at the time of signing
This approach allows for tamper detection - if the text content is modified after embedding, the hash of the current content will no longer match the hash stored in the embedded manifest.
How Our C2PA-like Embedding Actually Works
Single-Point Embedding with Zero-Width Characters
- The metadata (manifest) is embedded as a sequence of Unicode variation selectors
- These are zero-width, non-printing characters (code points in ranges U+FE00-FE0F and U+E0100-E01EF)
- All metadata is attached to a single character in the text (by default, the first whitespace)
- The original character is preserved, and the variation selectors are inserted immediately after it
This is Still Hard Binding Because
- The manifest is directly embedded within the content itself
- The manifest travels with the content as part of the same file
- The binding is inseparable from the content
Not a Hybrid Approach Because
In a true hybrid approach, you would have: 1. A manifest stored separately from the content (soft binding component) 2. A small reference embedded in the content pointing to the external manifest (hard binding component)
Our implementation embeds the entire manifest directly in the content. The content hash we include is just an assertion within the hard-bound manifest.
Implementation Details
Embedding Process
def embed_metadata(text, metadata, metadata_format="json"):
"""
Embeds metadata into text using Unicode variation selectors.
Args:
text (str): The text to embed metadata into
metadata (dict or bytes): The metadata to embed
metadata_format (str): Format of the metadata ("json" or "cbor_manifest")
Returns:
str: Text with embedded metadata
"""
# Serialize metadata based on format
if metadata_format == "json":
serialized = json.dumps(metadata).encode("utf-8")
elif metadata_format == "cbor_manifest":
if isinstance(metadata, dict):
serialized = cbor2.dumps(metadata)
else:
serialized = metadata # Already serialized
else:
raise ValueError(f"Unsupported metadata format: {metadata_format}")
# Convert to binary and encode using variation selectors
binary_data = base64.b64encode(serialized).decode("ascii")
encoded_metadata = _encode_to_variation_selectors(binary_data)
# Find position to insert (typically after first character)
if len(text) > 0:
return text[0] + encoded_metadata + text[1:]
else:
return encoded_metadata
Extraction Process
def extract_metadata(text, metadata_format="json"):
"""
Extracts metadata from text with embedded Unicode variation selectors.
Args:
text (str): Text with embedded metadata
metadata_format (str): Format of the metadata ("json" or "cbor_manifest")
Returns:
dict or bytes: Extracted metadata
"""
# Extract variation selectors
encoded_data = ""
for char in text:
if 0xFE00 <= ord(char) <= 0xFE0F or 0xE0100 <= ord(char) <= 0xE01EF:
encoded_data += char
if not encoded_data:
return None
# Decode from variation selectors to binary
binary_data = _decode_from_variation_selectors(encoded_data)
serialized = base64.b64decode(binary_data)
# Deserialize based on format
if metadata_format == "json":
return json.loads(serialized.decode("utf-8"))
elif metadata_format == "cbor_manifest":
return cbor2.loads(serialized)
else:
raise ValueError(f"Unsupported metadata format: {metadata_format}")
Verification Process
The verification process involves two key steps:
- Signature Verification: Ensures the manifest itself hasn't been tampered with
- Extracts the embedded metadata using Unicode variation selectors
- Verifies the digital signature using the provided public key
-
If the signature is invalid, verification fails immediately
-
Content Hash Verification: Ensures the text content hasn't been modified
- Extracts the stored content hash from the manifest
- Calculates a fresh hash of the current content using the same algorithm
- Compares the stored hash with the freshly calculated hash
- If they don't match, the content has been tampered with
This two-step verification process provides comprehensive tamper detection for both the manifest and the content it describes.
Advantages of This Approach
- Invisibility: The embedding doesn't visibly alter the text appearance
- Portability: The metadata travels with the content
- Robustness: Works across different text formats and platforms
- Standards Alignment: Compatible with C2PA concepts and structures
- Tamper Detection: Provides comprehensive verification of both metadata and content integrity