A Python package for embedding and extracting metadata in text using Unicode variation selectors without affecting readability.
EncypherAI Core provides tools for invisibly encoding metadata (such as model information, timestamps, and custom data) into text generated by AI models. This enables:
- Provenance tracking: Identify which AI model generated a piece of text
- Timestamp verification: Know when text was generated
- C2PA-Compatible Manifests: Embed manifests inspired by the C2PA standard, with support for CBOR encoding for maximum interoperability.
- Custom metadata: Embed any additional information you need
- Tamper detection using digital signatures: Verify text integrity using digital signatures
- Streaming support: Works with both streaming and non-streaming LLM outputs
- LLM integrations: Ready-to-use integrations with OpenAI, Google Gemini, Anthropic Claude, and more
- Modular architecture: Clean separation of key management, payload handling, and signing operations
The encoding is done using Unicode variation selectors, which are designed to specify alternative forms of characters without affecting text appearance or readability.
EncypherAI's manifest format is inspired by the Coalition for Content Provenance and Authenticity (C2PA) standard, adapted specifically for plain-text environments. While C2PA focuses on embedding provenance information in rich media file formats, EncypherAI extends these concepts to text-only content where traditional file embedding methods aren't applicable.
Key alignments include:
- Structured provenance manifests with claim generators and actions
- Cryptographic integrity through digital signatures
- Content hash verification for tamper detection
- CBOR Manifests: Support for embedding full C2PA-compliant manifests using CBOR for a compact, standards-aligned format
- Hard binding approach: Direct embedding of manifests into the content itself
- Shared mission of improving content transparency and trust
- Spec version: Currently aligned with the C2PA 2.2 specification
Our implementation uses Unicode variation selectors (U+FE00 to U+FE0F) to invisibly embed C2PA manifests directly into text content, enabling provenance tracking and tamper detection without altering the visible appearance of the text.
Learn more about EncypherAI's relationship with C2PA in our documentation.
EncypherAI seamlessly integrates with popular LLM providers:
- OpenAI: GPT-3.5, GPT-4o, and other OpenAI models
- Google Gemini: Gemini 2.0 Flash, Pro, and other Gemini models
- Anthropic Claude: Claude 3 Opus, Sonnet, Haiku, and other Claude models
- LiteLLM: For unified access to multiple LLM providers
Check our documentation for detailed integration examples and code snippets for each provider.
Watch our demo video to see EncypherAI in action, demonstrating how to embed and verify metadata in AI-generated content.
Try EncypherAI directly in your browser with our interactive Google Colab notebook. The notebook demonstrates all the core features including metadata embedding, extraction, digital signature verification, and tampering detection.
For a local demonstration, check out the detailed Jupyter Notebook example included in the repository:
encypher/examples/encypher_v2_demo.ipynb
This notebook covers key generation, basic and manifest format usage, and tamper detection using the latest version (v2.2.0+).
First, install the uv package manager if you don't have it already:
# Install uv (recommended)
pip install uv
# Then install EncypherAI
uv pip install encypher-ai
Note: Digital signatures require managing a private/public key pair. You can use the helper script
encypher/examples/generate_keys.py
to create your first key pair and get setup instructions, or generate keys programmatically as shown below.
from encypher.core.unicode_metadata import UnicodeMetadata
from encypher.core.keys import generate_ed25519_key_pair # Updated to specific key type
from cryptography.hazmat.primitives.asymmetric.types import Ed25519PublicKey, Ed25519PrivateKey
from typing import Optional, Dict, Union # Added Union
import time
from encypher.core.payloads import BasicPayload, ManifestPayload # For type hinting verified_payload
# --- Key Management (Replace with your actual key management) ---
# Generate a new Ed25519 key pair
private_key: Ed25519PrivateKey
public_key: Ed25519PublicKey
private_key, public_key = generate_ed25519_key_pair()
signer_id_example = "readme-signer-001" # Using signer_id
# Store public keys (e.g., in a database or secure store)
public_keys_store: Dict[str, Ed25519PublicKey] = { signer_id_example: public_key }
# Create a provider function to look up public keys by ID
def public_key_provider(signer_id: str) -> Optional[Ed25519PublicKey]: # Renamed and uses signer_id
return public_keys_store.get(signer_id)
# -----------------------------------------------------------------
# Core information for embedding
current_timestamp = int(time.time()) # Current Unix timestamp (seconds since epoch)
# Custom metadata payload (user-defined data)
custom_payload = {
"model_id": "gpt-4o-2024-05-13",
"source_script": "README_quickstart",
"user_defined_version": "2.3.0" # Updated version
}
# Embed metadata and sign
# The 'metadata_format' and 'version' (EncypherAI spec version) parameters for embed_metadata
# default to "basic" and the latest spec version respectively.
encoded_text = UnicodeMetadata.embed_metadata(
text="This is a sample text generated by an AI model.",
private_key=private_key, # Private key for signing
signer_id=signer_id_example, # Identifier for the key pair
timestamp=current_timestamp, # Integer Unix timestamp
custom_metadata=custom_payload, # Your arbitrary metadata
omit_keys=["user_id", "session_id"], # Example of redacting fields
)
# Extract metadata (without verification - returns the raw payload if successful)
# This is useful for quick inspection but does not guarantee authenticity or integrity.
extracted_unverified_payload = UnicodeMetadata.extract_metadata(encoded_text)
print(f"Extracted (unverified) payload: {extracted_unverified_payload}")
# Verify the signature and extract metadata using the public key provider
# This is the recommended way to get trusted metadata.
is_valid: bool
extracted_signer_id: Optional[str]
verified_payload: Union[BasicPayload, ManifestPayload, None] # Type hint for clarity
is_valid, extracted_signer_id, verified_payload = UnicodeMetadata.verify_metadata(
text=encoded_text,
public_key_provider=public_key_provider
)
print(f"\nSignature valid: {is_valid}")
if is_valid and verified_payload:
print(f"Verified Signer ID: {extracted_signer_id}")
print(f"Verified Timestamp: {verified_payload.timestamp}")
print(f"Verified Custom Metadata: {verified_payload.custom_metadata}")
print(f"Verified Format: {verified_payload.format}")
print(f"Verified EncypherAI Spec Version: {verified_payload.version}")
else:
print("Metadata could not be verified or extracted.")
from encypher.streaming.handlers import StreamingHandler
from encypher.core.unicode_metadata import UnicodeMetadata # Added for verification
from encypher.core.keys import generate_ed25519_key_pair # Updated to specific key type
from cryptography.hazmat.primitives.asymmetric.types import Ed25519PublicKey, Ed25519PrivateKey
from typing import Optional, Dict, Union # Added Union
import time
from encypher.core.payloads import BasicPayload, ManifestPayload # For type hinting verified_payload
# --- Assuming key setup from the 'Basic Encoding and Verification' example ---
# Custom metadata for streaming example
stream_timestamp = int(time.time())
stream_custom_payload = {
"model_id": "gpt-4o-2024-05-13",
"source_script": "README_streaming_example",
"user_defined_version": "2.3.0" # Updated version
}
# Create a streaming handler
handler = StreamingHandler(
private_key=private_key,
signer_id=signer_id_example,
timestamp=stream_timestamp,
custom_metadata=stream_custom_payload,
# metadata_format defaults to "basic" (also accepts "manifest", "cbor_manifest", or "jumbf")
# encode_first_chunk_only defaults to True, which is common for streaming
)
chunks = [
"This is ",
"a sample ",
"text generated ",
"by an AI model, delivered in chunks."
]
full_response_from_stream = ""
print("\nSimulating stream output:")
for chunk in chunks:
processed_chunk = handler.process_chunk(chunk) # Aligned to process_chunk
if processed_chunk: # process_chunk might return None if it only buffers
print(processed_chunk, end="")
full_response_from_stream += processed_chunk
# Complete the stream (important for final metadata embedding if not all chunks were processed)
final_chunk = handler.finalize()
if final_chunk:
print(final_chunk, end="")
full_response_from_stream += final_chunk
print("\n--- End of Stream ---")
# Verify the full streamed text
# For streamed content, hard binding is not added, so we must disable it during verification.
print(f"\nVerifying full streamed text: '{full_response_from_stream[:50]}...' ({len(full_response_from_stream)} chars)")
is_stream_valid: bool
stream_signer_id: Optional[str]
stream_payload: Union[BasicPayload, ManifestPayload, None]
is_stream_valid, stream_signer_id, stream_payload = UnicodeMetadata.verify_metadata(
text=full_response_from_stream,
public_key_provider=public_key_provider, # Using the provider from basic example
require_hard_binding=False # Disable for streaming
)
print(f"\nStream signature valid: {is_stream_valid}")
if is_stream_valid and stream_payload:
print(f"Stream Verified Signer ID: {stream_signer_id}")
print(f"Stream Verified Timestamp: {stream_payload.timestamp}")
print(f"Stream Verified Custom Metadata: {stream_payload.custom_metadata}")
print(f"Stream Verified Format: {stream_payload.format}")
print(f"Stream Verified EncypherAI Spec Version: {stream_payload.version}")
else:
print("Stream metadata could not be verified or extracted.")
- Invisible Metadata: Embed metadata in text without affecting its visible appearance or readability.
- Digital Signatures: Cryptographically sign metadata to ensure authenticity and detect tampering.
- Streaming Support: Process and embed metadata in real-time as text is generated or streamed.
- Customizable Metadata: Embed any JSON-serializable information relevant to your application.
- Modular Architecture: Clean separation of key management, payload handling, and signing operations.
EncypherAI includes a command-line interface for quick encoding and decoding tasks.
First, ensure you have generated a key pair. You can use the generate-keys
command for this:
# Generate a new Ed25519 key pair, saving public key as my_signer_id.pem
python -m encypher.examples.cli_example generate-keys --output-dir ./keys --signer-id my_signer_id
This will create private_key.pem
and keys/my_signer_id.pem
(if --output-dir ./keys
was used and my_signer_id
was the signer ID).
To encode text with metadata:
# Basic encoding
python -m encypher.examples.cli_example encode \
--text "This is a sample text generated by an AI model." \
--private-key ./keys/private_key.pem \
--signer-id my_signer_id
Additional options:
--output-file
: Optional. File to save encoded text; otherwise, prints to stdout.--custom-metadata
: Optional. A JSON string for your custom data (e.g.,'{\"key\": \"value\"}'
).--timestamp
: Optional. Integer Unix timestamp. Defaults to current time.--model-id
: Optional. Convenience for adding a model ID to custom metadata.--omit-keys
: Optional. Space separated list of metadata keys to omit before signing.
To extract and verify metadata from text:
# Basic decoding
python -m encypher.examples.cli_example decode \
--text "Text with embedded metadata..." \
--public-key-dir ./keys
Additional options:
--output-file
: Optional. File to save decoded metadata; otherwise, prints to stdout.--verify
: Optional. Verify the signature (default: True).
To set up the project for development:
# Install uv (if not already installed)
pip install uv
# Install EncypherAI in editable mode for development
uv pip install -e .
If you're upgrading from a previous version, check out our Migration Guide to learn about the new modular structure introduced in version 2.2.0.
EncypherAI supports embedding structured manifests inspired by the C2PA standard, providing robust provenance and tamper detection for text content. This is ideal for tracking the origin and history of AI-generated text.
The library handles the creation of the manifest, including calculating the content hash of the original, un-encoded text, and bundles it into a CBOR-encoded, signed payload.
This example demonstrates the end-to-end process:
# --- Imports and Key Setup (from Quick Start) ---
from encypher.core.unicode_metadata import UnicodeMetadata
from encypher.core.keys import generate_ed25519_key_pair
from cryptography.hazmat.primitives.asymmetric.types import Ed25519PublicKey, Ed25519PrivateKey
from typing import Optional, Dict, Union
import time
from encypher.core.payloads import BasicPayload, ManifestPayload
# Generate a new Ed25519 key pair
private_key: Ed25519PrivateKey
public_key: Ed25519PublicKey
private_key, public_key = generate_ed25519_key_pair()
signer_id_manifest = "manifest-signer-001"
# Store public keys and create a provider function
public_keys_store: Dict[str, Ed25519PublicKey] = { signer_id_manifest: public_key }
def public_key_provider(signer_id: str) -> Optional[Ed25519PublicKey]:
return public_keys_store.get(signer_id)
# ----------------------------------------------------
# Original text to be signed
original_text = "This is an important statement generated by an AI assistant."
# Embed a C2PA-inspired manifest
# The library automatically calculates the content hash of the original text.
encoded_text_manifest = UnicodeMetadata.embed_metadata(
text=original_text,
private_key=private_key,
signer_id=signer_id_manifest,
timestamp=int(time.time()),
metadata_format="cbor_manifest", # Use the CBOR manifest format ("jumbf" is also supported)
claim_generator="EncypherAI README Example v2.3",
actions=[
{
"action": "c2pa.created",
"when": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"description": "Text was created by an AI model."
}
],
ai_info={
"model_id": "gpt-4o-2024-05-13",
"prompt": "Write a short, important statement."
}
)
print(f"Text with embedded manifest: '{encoded_text_manifest[:60]}...'")
### Verifying the Manifest and Detecting Tampering
Verification confirms both the signature's authenticity and the text's integrity. Any change to the original text will cause verification to fail.
# 1. Verify the original, unmodified text
is_valid, signer, payload = UnicodeMetadata.verify_metadata(
text=encoded_text_manifest,
public_key_provider=public_key_provider
)
print(f"\nVerification of original text successful: {is_valid}")
if is_valid and payload:
print(f" - Signer ID: {signer}")
# The payload is a ManifestPayload object, so we can access its attributes
if isinstance(payload, ManifestPayload):
print(f" - Claim Generator: {payload.claim_generator}")
print(f" - Actions: {payload.actions}")
# The content hash is stored inside the manifest and checked automatically
# during verification.
# 2. Attempt to verify tampered text
tampered_text = encoded_text_manifest.replace("important", "unimportant")
is_tampered_valid, _, _ = UnicodeMetadata.verify_metadata(
text=tampered_text,
public_key_provider=public_key_provider
)
print(f"\nVerification of tampered text successful: {is_tampered_valid}") # Expected: False
if not is_tampered_valid:
print(" - As expected, verification failed, indicating the text was tampered with.")
The content hash in our implementation covers only the plain text content:
- Extracts all paragraph text from the document (ignoring HTML markup)
- Computes a SHA-256 hash on the UTF-8 encoded text
- Calculates the hash before any metadata embedding occurs
- Does not include the embedded metadata itself (Unicode variation selectors)
Our embedding engine uses a robust, multi-strategy approach to invisibly attach metadata to text using Unicode variation selectors (e.g., U+FE00-FE0F):
- Distributed Embedding (Primary): To maximize resilience, metadata characters are intelligently interleaved throughout the source text, attached to multiple characters.
- Append-to-End (Fallback): If the text is too short or lacks suitable characters for the primary strategy, the metadata is appended to the end of the string.
- Content Integrity: In both cases, the original visible text is preserved, and a "hard-binding" content hash in the manifest ensures its integrity.
This hybrid approach ensures that metadata can be reliably embedded in a wide variety of text content.
For more detailed information, see our Content Hash and Embedding Technical Guide and Tamper Detection Guide.
This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details. Commercial licensing options are also available - see our Licensing Guide for details.