Deploy PyTorch models directly to edge devices. Text, vision, and audio AI with privacy-preserving, real-time inference — no cloud required.

Get Started View on GitHub

Why On-Device AI Matters

Enhanced Privacy

Data never leaves the device. Process personal content, conversations, and media locally without cloud exposure.

Real-Time Response

Instant inference with no network round-trips. Perfect for AR/VR experiences, multimodal AI interactions, and responsive conversational agents.

Offline & Low-Bandwidth Ready

Zero network dependency for inference. Works seamlessly in low-bandwidth regions, remote areas, or completely offline.

$

Cost Efficient

No cloud compute bills. No API rate limits. Scale to billions of users without infrastructure costs growing linearly.

Models Are Getting Smaller & Smarter

The convergence of efficient architectures and edge hardware creates new opportunities

Dramatically Smaller
Modern LLMs achieve high quality at a fraction of historical sizes
Edge-Ready Performance
Real-time inference on consumer smartphones
Quantization Benefits
Significant size reduction while preserving accuracy

The opportunity is now: Foundation models have crossed the efficiency threshold. Deploy sophisticated AI directly where data lives.

Why On-Device AI Was Hard

Power Constraints

From battery-powered phones to energy-harvesting sensors, edge devices have strict power budgets. Microcontrollers may run on milliwatts, requiring extreme efficiency.

Thermal Management

Sustained inference generates heat without active cooling. From smartphones to industrial IoT devices, thermal throttling limits continuous AI workloads.

Memory Limitations

Edge devices range from high-end phones to tiny microcontrollers. Beyond capacity, limited memory bandwidth creates bottlenecks when moving tensors between compute units.

Hardware Heterogeneity

From microcontrollers to smartphone NPUs to embedded GPUs. Each architecture demands unique optimizations, making broad deployment across diverse form factors extremely challenging.

PyTorch Powers >90% of AI Research

But deploying PyTorch models to edge devices meant losing everything that made PyTorch great

Research & Training

PyTorch's intuitive APIs and eager execution power breakthrough research

The Conversion Nightmare

Multiple intermediate formats, custom runtimes, C++ rewrites

The Hidden Costs of Conversion (Status Quo)

Lost Semantics

PyTorch operations don't map 1:1 to other formats

Debugging Nightmare

Can't trace errors back to original PyTorch code

Vendor-Specific Formats

Locked into proprietary formats with limited operator support

Language Barriers

Teams spend months rewriting Python models in C++ for production

ExecuTorch
PyTorch's On-Device AI Framework

No Conversions

Direct export from PyTorch to edge. Core ATen operators preserved. No intermediate formats, no vendor lock-in.

Ahead-of-Time Compilation

Optimize models offline for target device capabilities. Hardware-specific performance tuning before deployment.

Modular by Design

Pick and choose optimization steps. Composable at both compile-time and runtime for maximum flexibility.

Hardware Ecosystem

Fully open source with hardware partner contributions. Built on PyTorch's standardized IR and operator set.

Embedded-Friendly Runtime

Portable C++ runtime runs on microcontrollers to smartphones.

PyTorch Ecosystem

Native integration with PyTorch ecosystem, including torchao for quantization. Stay in familiar tools throughout.

Simple as 1-2-3

Export, optimize, and run PyTorch models on edge devices

1. Export Your PyTorch Model

import torch

# Your existing PyTorch model
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)

# Export to create semantically equivalent graph
exported_program = torch.export.export(model, example_inputs)

2. Optimize for Target Hardware

Switch between backends with a single line change

Choose your target hardware to see the corresponding code:
CPU Optimization
XNNPACK with Arm Kleidi
Apple Devices
Core ML partitioner
Qualcomm® AI Engine
Qualcomm® Hexagon™ NPU
+ 9 More
Vulkan, MediaTek, Samsung...
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[XnnpackPartitioner()]
).to_executorch()

# Save to .pte file
with open("model.pte", "wb") as f:
    f.write(program.buffer)
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.apple.coreml.partition.coreml_partitioner import CoreMLPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[CoreMLPartitioner()]
).to_executorch()

# Save to .pte file
with open("model.pte", "wb") as f:
    f.write(program.buffer)
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.qualcomm.partition.qnn_partitioner import QnnPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[QnnPartitioner()]
).to_executorch()

# Save to .pte file
with open("model.pte", "wb") as f:
    f.write(program.buffer)

3. Run on Any Platform

Choose your platform to see the native API:
C++
Swift
Kotlin
Objective-C
WebAssembly
#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>

Module module("model.pte");
auto tensor = make_tensor_ptr({2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
auto outputs = module.forward(tensor);
import ExecuTorch

let module = Module(filePath: "model.pte")
let input = Tensor<Float>([1.0, 2.0, 3.0, 4.0], shape: [2, 2])
let outputs = try module.forward(input)
val module = Module.load("model.pte")
val inputTensor = Tensor.fromBlob(floatArrayOf(1.0f, 2.0f, 3.0f, 4.0f), longArrayOf(2, 2))
val outputs = module.forward(EValue.from(inputTensor))
#import <ExecuTorch/ExecuTorch.h>

NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"model" ofType:@"pte"];
ExecuTorchModule *module = [[ExecuTorchModule alloc] initWithFilePath:modelPath];

float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
ExecuTorchTensor *input = [[ExecuTorchTensor alloc] initWithBytes:data
                                                            shape:@[@2, @2]
                                                         dataType:ExecuTorchDataTypeFloat];
NSArray<ExecuTorchValue *> *outputs = [module forwardWithTensor:input error:nil];
// Load model from file or buffer
const module = et.Module.load("model.pte");
// Create input tensor from array
const input = et.Tensor.fromArray([2, 2], [1.0, 2.0, 3.0, 4.0]);
// Run inference
const outputs = module.forward([input]);

Available on Android, iOS, Linux, Windows, macOS, and embedded microcontrollers (e.g., DSP and Cortex-M processors)

Need advanced features? ExecuTorch supports memory planning, quantization, profiling, and custom compiler passes.

Try the Full Tutorial →

High-Level Multimodal APIs

Run complex multimodal LLMs with simplified C++ interfaces

Multimodal Runner - Text + Vision + Audio in One API

Choose your platform to see the multimodal API supporting text, images, and audio:

Unified API across mobile platforms:
C++
Cross-platform
Swift
iOS native
Kotlin
Android native
#include <executorch/extension/llm/runner/multimodal_runner.h>

// Create multimodal runner (LLaVA, Voxtral, etc.)
auto tokenizer = load_tokenizer("tokenizer.model");
auto runner = create_multimodal_runner(
    "llava.pte", std::move(tokenizer)
);

// Build multimodal inputs (text + image)
std::vector inputs;
inputs.emplace_back(make_text_input("Describe this image:"));
inputs.emplace_back(make_image_input(std::move(image)));

GenerationConfig config;
config.max_new_tokens = 100;

// Generate with streaming callback
runner->generate(inputs, config,
    [](std::string token) { std::cout << token; }
);
import ExecuTorch
import AVFoundation

// Initialize multimodal runner with audio support
let runner = try MultimodalRunner(
    modelPath: "model.pte",
    visionPath: "vision.pte",
    audioPath: "audio.pte",
    tokenizerPath: tokenizerPath,
    temperature: 0.7
)

// Process audio and image inputs
let audioTensor = AudioProcessor.preprocess(audioURL)
let imageTensor = ImageProcessor.preprocess(uiImage)

// Generate with audio + vision + text
let result = try runner.generateMultimodal(
    prompt: "Describe what you hear and see",
    audio: audioTensor,
    image: imageTensor,
    maxTokens: 512
)

// Stream tokens to UI
result.tokens.forEach { token in
    DispatchQueue.main.async {
        responseText += token
    }
}
import org.pytorch.executorch.MultimodalRunner
import android.media.MediaRecorder

// Initialize multimodal runner with audio
val runner = MultimodalRunner.create(
    modelPath = "model.pte",
    visionPath = "vision.pte",
    audioPath = "audio.pte",
    tokenizerPath = tokenizerPath,
    temperature = 0.7f
)

// Process audio and image inputs
val audioTensor = AudioProcessor.preprocess(audioFile)
val imageTensor = ImageProcessor.preprocess(bitmap)

// Generate with audio + vision + text
val result = runner.generateMultimodal(
    prompt = "Describe what you hear and see",
    audio = audioTensor,
    image = imageTensor,
    maxTokens = 512
)

// Display streaming response
result.tokens.forEach { token ->
    runOnUiThread {
        responseView.append(token)
    }
}

High-level APIs abstract away model complexity - just load, prompt, and get results

Explore LLM APIs →

Universal AI Runtime

💬 LLMs 👁️ Computer Vision 🎤 Speech AI 🎯 Recommendations 🧠 Multimodal ⚡ Any PyTorch Model 💬 LLMs 👁️ Computer Vision 🎤 Speech AI 🎯 Recommendations 🧠 Multimodal ⚡ Any PyTorch Model

Comprehensive Hardware Ecosystem

12+ hardware backends with acceleration contributed by industry partners via open source

XNNPACK with Arm Kleidi

CPU acceleration across Arm and x86 architectures

Apple Core ML for Apple silicon

Neural Engine and Apple Silicon optimization

Qualcomm® AI Engine for Qualcomm® Hexagon™ NPU

Hardware-accelerated AI inference on Qualcomm platforms

Arm Ethos-U NPU

Microcontroller NPU for ultra-low power

Vulkan GPU

Cross-platform graphics acceleration

OpenVINO from Intel

x86 CPU and integrated GPU optimization

MediaTek NPU

Dimensity chipset acceleration

Samsung Exynos NPU

Integrated NPU optimization

NXP Semiconductors' eIQ® Neutron NPU

Automotive and IoT acceleration

Apple Metal Performance Shaders (MPS)

GPU acceleration on macOS and iOS

Arm VGF

Versatile graphics framework support

Cadence DSP

Digital signal processor optimization

→ View detailed backend documentation

Success Stories

Production deployments and strategic partnerships accelerating edge AI

Production Deployments

  • Meta Family of Apps: Production deployment across Instagram, Facebook, and WhatsApp serving billions of users
  • Meta Reality Labs: Powers Quest 3 VR and Ray-Ban Meta Smart Glasses AI experiences

Ecosystem Integration

  • Hugging Face: Optimum-ExecuTorch for direct transformer model deployment
  • LiquidAI: Next-generation Liquid Foundation Models optimized for edge deployment
  • Software Mansion: React Native ExecuTorch bringing edge AI to mobile apps

Examples & Models

→ View all success stories

Ready to Deploy AI at the Edge?

Join thousands of developers using ExecuTorch in production

Get Started Today