Private AI: How to Run Models Without Exposing Your Data
Every AI API call sends your data to someone else's server. This guide covers the full spectrum of private AI approaches — from local inference to hardware-isolated enclaves — and shows how to run production AI workloads where not even the cloud operator can see your prompts or outputs.

Every time you send a prompt to an AI API, you're sending your data to someone else's infrastructure. The prompt, the context, the documents you attached — all of it travels to a server you don't control, gets processed in memory you can't inspect, and (in some cases) gets logged, stored, or used for training.
For personal use, this is a tradeoff most people accept. For enterprise workloads — medical records, financial models, legal documents, proprietary code, customer data — it's a dealbreaker.
The emerging field of private AI addresses this directly: how do you get the capabilities of frontier AI models while keeping your data confidential?
This isn't a theoretical question anymore. Regulations are catching up. The EU AI Act is in force. HIPAA hasn't changed, but its enforcement around AI-processed health data has intensified. And enterprises are increasingly reluctant to send proprietary data to third-party inference APIs after high-profile data exposure incidents.
Here's the full landscape of private AI approaches, what each one actually gives you, and where the field is headed.
The Data Exposure Problem
When you call an AI inference API — OpenAI, Anthropic, Google, or any hosted model — the following happens:
- Your prompt (including any documents, code, or data in context) is transmitted to the provider's servers
- The data is decrypted in server memory for inference
- The model processes it and generates a response
- The response is transmitted back to you
During step 2, your data exists in plaintext on hardware you don't own, in a data center you've never audited, managed by employees you've never vetted. The provider promises they won't look at it (via their privacy policy and DPA), but you can't verify that promise cryptographically.
For most API providers:
- Data retention varies. Some providers retain inputs for 30 days for abuse monitoring. Some claim zero retention on API plans. The actual enforcement is trust-based.
- Subprocessors add risk. Your data may pass through load balancers, logging systems, and monitoring tools operated by third parties.
- Legal jurisdiction matters. If the provider is US-based, US law enforcement can compel data access. If your data is subject to GDPR, you need a legal basis for this transfer.
This isn't fear-mongering — it's the standard threat model that any security team evaluates before adopting AI infrastructure.
Approach 1: Run Models Locally
The most intuitive approach. Download an open-source model and run it on your own hardware. Tools like Ollama, LM Studio, and llama.cpp have made this remarkably accessible.
What you get:
- Complete data control — prompts never leave your machine
- No API costs
- Works offline
- Full model customization (quantization, fine-tuning, LoRA adapters)
What you give up:
- Model quality. The best open-source models (Llama 3, Mistral, DeepSeek) are impressive but still trail frontier proprietary models on complex reasoning, coding, and multi-step tasks. For many use cases this gap is narrowing. For some, it's already closed.
- Hardware requirements. Running a 70B parameter model requires serious GPU infrastructure. A100/H100 GPUs cost $10,000-$40,000 each. Quantized models run on consumer hardware but with quality degradation.
- Operational burden. You're responsible for updates, scaling, GPU drivers, CUDA compatibility, and monitoring. This is a real engineering cost.
- No attestation. You can verify your own hardware, but you can't prove to a third party what model ran or that the computation wasn't tampered with. There's no cryptographic audit trail.
Best for: Development and prototyping. Single-user or small-team use cases where model quality requirements are met by open-source models and you have adequate hardware.
Approach 2: Private API Agreements (Enterprise Tiers)
OpenAI, Anthropic, Google, and Azure all offer enterprise tiers with contractual privacy guarantees: no data retention, no training on your inputs, dedicated infrastructure, and custom Data Processing Agreements.
What you get:
- Frontier model quality (GPT-4, Claude, Gemini)
- Contractual (legal) privacy guarantees
- Usually SOC2 Type II certified infrastructure
- Dedicated support and SLAs
What you give up:
- Guarantees are legal, not technical. The provider agrees not to use your data. You trust they comply. You cannot verify it cryptographically. A rogue employee, a misconfigured logging pipeline, or a government subpoena can still expose your data.
- Cost. Enterprise AI plans range from $60,000 to $500,000+ per year depending on the provider and usage.
- Vendor lock-in. Your workflows become dependent on one provider's API, models, and pricing.
- Compliance gray areas. For HIPAA, GDPR, and financial regulations, "the provider promises they won't look" may not satisfy auditors. Some regulatory frameworks require technical controls, not just contractual ones.
Best for: Large enterprises with established vendor management programs, where the legal department is comfortable with contractual guarantees and the use case doesn't involve the most sensitive data categories (PHI, classified, etc.).
Approach 3: Confidential AI Inference in TEEs
This is the approach that changes the trust model fundamentally. Instead of trusting the provider's promise not to look at your data, you verify it with hardware-backed cryptographic proof.
A Trusted Execution Environment (TEE) creates a hardware-isolated enclave where the AI model runs. The CPU encrypts the enclave's memory with keys that only the hardware controls. The host operating system, the hypervisor, the cloud operator, and every other process on the machine cannot access the enclave's memory — even with root access.
How confidential AI inference works:
- The AI model is loaded into a TEE (e.g., an AWS Nitro Enclave)
- Your prompt is sent to the enclave over an encrypted channel
- Inference runs inside the enclave — data is decrypted only within the hardware boundary
- The response is encrypted and returned to you
- The enclave produces an attestation document proving what model ran and that the environment wasn't tampered with
What you get:
- Technical privacy guarantee. Not a promise — a hardware-enforced boundary. The cloud operator cannot see your prompts or outputs, verified by CPU-level isolation.
- Attestation. Cryptographic proof of exactly what model and code ran. The attestation document includes Platform Configuration Registers (PCRs) that fingerprint the enclave image, kernel, and application. You can verify this remotely before sending any data.
- Compliance evidence. Attestation documents map directly to HIPAA, SOC2, FIPS, and Common Criteria audit requirements. You can prove technically that data was processed in a controlled environment, not just claim it contractually.
- Open-source or proprietary models. You can run any model that fits in the enclave's allocated memory — Llama, Mistral, DeepSeek, or a proprietary fine-tuned model.
- Scalability. Enclaves run on standard cloud infrastructure (EC2 instances with Nitro support). You can scale horizontally like any other cloud workload.
What you give up:
- Hardware trust assumption. You trust the CPU manufacturer (AWS/Intel/AMD) to correctly implement the isolation. If the hardware has a vulnerability, the boundary can be broken. This is a real risk, but the attack surface is orders of magnitude smaller than trusting an entire software stack.
- Resource constraints. Enclaves have limited memory. Running a 70B model requires large enclave allocations. Smaller quantized models (7B-13B) fit more comfortably.
- Added latency. Enclave setup and attestation verification add a small overhead to the first request. Subsequent inference latency is near-native.
Best for: Healthcare AI (HIPAA), financial modeling, legal document analysis, processing classified or regulated data, any use case where you need both high model quality and verifiable privacy.
Approach 4: FHE-Based Private Inference
Fully Homomorphic Encryption (FHE) allows running inference on encrypted data — the model processes ciphertexts and produces encrypted results without ever decrypting the input. The server mathematically cannot see your data.
What you get:
- The strongest possible privacy guarantee — cryptographic, not hardware-based
- The server never sees plaintext at any point during computation
What you give up:
- Impractical performance for most AI workloads. FHE inference on a transformer model is currently 10,000-1,000,000x slower than plaintext inference. A query that takes 100ms normally could take hours under FHE.
- Model architecture constraints. Not all neural network operations are efficiently expressible in FHE circuits. Attention mechanisms, softmax, and non-linear activations require expensive approximations.
- Active research. Companies like Zama are making progress with Concrete-ML, but production-grade FHE inference for large language models is still years away.
Best for: Niche use cases with small models and high latency tolerance. Primarily an academic and research area today, but likely to become practical within 3-5 years as hardware accelerators mature.
Comparison Matrix
| Dimension | Local Models | Enterprise API | TEE Inference | FHE Inference |
|---|---|---|---|---|
| Privacy guarantee | Physical (your hardware) | Contractual (legal) | Hardware (CPU isolation) | Cryptographic (math) |
| Verifiable by third party? | No | No (audit only) | Yes (attestation) | Yes (cryptographic proof) |
| Model quality | Open-source only | Frontier (GPT-4, Claude) | Any (open-source or custom) | Limited (small models only) |
| Inference speed | Depends on hardware | Fast (optimized infra) | Near-native | 10,000x+ slower |
| Operational complexity | High (self-managed) | Low (managed service) | Medium (platform-managed) | High (specialized) |
| Regulatory compliance | You manage everything | Provider's certifications | Attestation as evidence | Strongest guarantee |
| Cost model | Hardware upfront | $60K-500K+/year | Cloud compute pricing | Extreme compute costs |
| Production ready? | Yes | Yes | Yes | No (research stage) |
The Real-World Use Cases Driving Adoption
Healthcare: AI on Patient Data
A hospital system wants to use AI to summarize patient records, flag medication interactions, and assist with diagnosis. The data is PHI (Protected Health Information) under HIPAA.
- Local models require every hospital to run its own GPU infrastructure. Expensive and hard to maintain.
- Enterprise API requires a BAA (Business Associate Agreement) with the AI provider. The provider sees PHI in plaintext during inference. Some compliance officers are uncomfortable with this.
- TEE inference keeps PHI encrypted in transit and hardware-isolated during processing. The attestation document proves the data was processed in a HIPAA-compliant enclave. Auditors can verify the PCR measurements independently.
Finance: Proprietary Trading Models
A hedge fund runs proprietary models on market data and internal signals. The model architecture and input features are trade secrets.
- Enterprise API means sending proprietary signals to a third-party server. Even with a DPA, the fund's competitive advantage depends on nobody seeing these inputs.
- TEE inference lets the fund deploy its model inside an enclave on cloud infrastructure without exposing the model weights or input data to the cloud provider. The fund verifies attestation to confirm the enclave is running their exact model.
Legal: Document Analysis
A law firm uses AI to review contracts, identify risk clauses, and summarize depositions. The documents are subject to attorney-client privilege.
- Sending privileged documents to any third-party API — even with contractual guarantees — creates a waiver risk. Some jurisdictions hold that sharing privileged data with a third-party technology provider waives privilege.
- TEE inference means the AI processes documents inside hardware isolation. The cloud provider never has access. The firm can demonstrate technical controls to satisfy privilege obligations.
Running Private AI with Treza
Treza turns the TEE approach into a deployable platform. You bring a Docker container with your model, and Treza handles the enclave lifecycle, attestation, and verification.
import { TrezaClient } from '@treza/sdk';
const treza = new TrezaClient({
baseUrl: 'https://app.trezalabs.com',
});
// Deploy an inference server into a Nitro Enclave
const enclave = await treza.createEnclave({
name: 'private-inference',
region: 'us-east-1',
walletAddress: '0xYourWallet...',
providerId: 'aws-nitro',
providerConfig: {
dockerImage: 'myorg/llama-inference:latest',
cpuCount: 4,
memoryMiB: 4096,
},
});
// Verify the enclave is running your exact model image
const verification = await treza.verifyAttestation(enclave.id, {
nonce: 'audit-' + Date.now(),
});
console.log('Model integrity verified:', verification.isValid);
console.log('HIPAA compliant:', verification.complianceChecks.hipaa);
console.log('SOC2 compliant:', verification.complianceChecks.soc2);The model runs inside the enclave at near-native speed. Your prompts and outputs never leave the hardware boundary. The attestation proves to regulators, auditors, and clients exactly what code processed their data.
For AI agents that need to pay for private inference, Treza supports x402 micropayments — agents discover and pay for enclave services autonomously using USDC on Base, with the payment key secured inside the same TEE.
Where Private AI Is Headed
Three trends are converging:
Open-source models are closing the gap. Llama 3, Mistral, DeepSeek-V3, and Qwen have reached quality levels where running them privately inside enclaves produces results comparable to proprietary APIs for most enterprise tasks. The argument "we need GPT-4 quality" weakens every six months.
Confidential computing is becoming default infrastructure. AWS, Azure, and GCP are all investing heavily in TEE-based offerings. Azure Confidential Computing, Google Confidential Compute, and AWS Nitro Enclaves are all GA. Within 2-3 years, running AI in a TEE will be as routine as running it in a container today.
Regulation is creating demand. The EU AI Act, evolving HIPAA enforcement guidance, and sector-specific regulations (FINRA for finance, FDA for medical AI) are all moving toward requiring technical controls for AI data processing — not just contractual ones. Organizations that adopt private AI infrastructure now will be ahead of the compliance curve.
The question is shifting from "should we use private AI?" to "how quickly can we deploy it?"
Further Reading
- What Is a TEE? Complete Guide — How hardware-isolated enclaves work
- MPC vs TEE vs FHE — Comparing privacy-preserving computation approaches
- Treza SDK on GitHub — Deploy private AI workloads in enclaves
- Ollama — Run models locally for development
- Confidential Computing Consortium — Industry initiative for TEE adoption
- EU AI Act — Regulatory framework for AI in Europe
Treza builds privacy infrastructure for crypto and finance. Deploy workloads in hardware-secured enclaves with cryptographic proof of integrity. Learn more.