ExecuTorch - On-Device AI Inference Powered by PyTorch

ExecuTorch
PyTorch's On-Device AI Framework

No Conversions

Direct export from PyTorch to edge. Core ATen operators preserved. No intermediate formats, no vendor lock-in.

Ahead-of-Time Compilation

Optimize models offline for target device capabilities. Hardware-specific performance tuning before deployment.

Modular by Design

Pick and choose optimization steps. Composable at both compile-time and runtime for maximum flexibility.

Hardware Ecosystem

Fully open source with hardware partner contributions. Built on PyTorch's standardized IR and operator set.

Embedded-Friendly Runtime

Portable C++ runtime runs on microcontrollers to smartphones.

PyTorch Ecosystem

Native integration with PyTorch ecosystem, including torchao for quantization. Stay in familiar tools throughout.

Simple as 1-2-3

Export, optimize, and run PyTorch models on edge devices

1. Export Your PyTorch Model

import torch

# Your existing PyTorch model
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)

# Export to create semantically equivalent graph
exported_program = torch.export.export(model, example_inputs)

2. Optimize for Target Hardware

Switch between backends with a single line change

Choose your target hardware to see the corresponding code:

CPU Optimization

XNNPACK with Arm Kleidi

Apple Devices

Core ML partitioner

Qualcomm® AI Engine

Qualcomm® Hexagon™ NPU

+ 9 More

Vulkan, MediaTek, Samsung...

from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[XnnpackPartitioner()]
).to_executorch()

# Save to .pte file
with open("model.pte", "wb") as f:
    f.write(program.buffer)

from executorch.exir import to_edge_transform_and_lower
from executorch.backends.apple.coreml.partition.coreml_partitioner import CoreMLPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[CoreMLPartitioner()]
).to_executorch()

# Save to .pte file
with open("model.pte", "wb") as f:
    f.write(program.buffer)

from executorch.exir import to_edge_transform_and_lower
from executorch.backends.qualcomm.partition.qnn_partitioner import QnnPartitioner

program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[QnnPartitioner()]
).to_executorch()

# Save to .pte file
with open("model.pte", "wb") as f:
    f.write(program.buffer)

3. Run on Any Platform

Choose your platform to see the native API:

C++

Swift

Kotlin

Objective-C

WebAssembly

#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>

Module module("model.pte");
auto tensor = make_tensor_ptr({2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
auto outputs = module.forward(tensor);

import ExecuTorch

let module = Module(filePath: "model.pte")
let input = Tensor<Float>([1.0, 2.0, 3.0, 4.0], shape: [2, 2])
let outputs = try module.forward(input)

val module = Module.load("model.pte")
val inputTensor = Tensor.fromBlob(floatArrayOf(1.0f, 2.0f, 3.0f, 4.0f), longArrayOf(2, 2))
val outputs = module.forward(EValue.from(inputTensor))

#import <ExecuTorch/ExecuTorch.h>

NSString *modelPath = [[NSBundle mainBundle] pathForResource:@"model" ofType:@"pte"];
ExecuTorchModule *module = [[ExecuTorchModule alloc] initWithFilePath:modelPath];

float data[] = {1.0f, 2.0f, 3.0f, 4.0f};
ExecuTorchTensor *input = [[ExecuTorchTensor alloc] initWithBytes:data
                                                            shape:@[@2, @2]
                                                         dataType:ExecuTorchDataTypeFloat];
NSArray<ExecuTorchValue *> *outputs = [module forwardWithTensor:input error:nil];

// Load model from file or buffer
const module = et.Module.load("model.pte");
// Create input tensor from array
const input = et.Tensor.fromArray([2, 2], [1.0, 2.0, 3.0, 4.0]);
// Run inference
const outputs = module.forward([input]);

Available on Android, iOS, Linux, Windows, macOS, and embedded microcontrollers (e.g., DSP and Cortex-M processors)

Need advanced features? ExecuTorch supports memory planning, quantization, profiling, and custom compiler passes.

Try the Full Tutorial →

High-Level Multimodal APIs

Run complex multimodal LLMs with simplified C++ interfaces

Multimodal Runner - Text + Vision + Audio in One API

Choose your platform to see the multimodal API supporting text, images, and audio:

Unified API across mobile platforms:

C++

Cross-platform

Swift

iOS native

Kotlin

Android native

#include <executorch/extension/llm/runner/multimodal_runner.h>

// Create multimodal runner (LLaVA, Voxtral, etc.)
auto tokenizer = load_tokenizer("tokenizer.model");
auto runner = create_multimodal_runner(
    "llava.pte", std::move(tokenizer)
);

// Build multimodal inputs (text + image)
std::vector inputs;
inputs.emplace_back(make_text_input("Describe this image:"));
inputs.emplace_back(make_image_input(std::move(image)));

GenerationConfig config;
config.max_new_tokens = 100;

// Generate with streaming callback
runner->generate(inputs, config,
    [](std::string token) { std::cout << token; }
);

import ExecuTorch
import AVFoundation

// Initialize multimodal runner with audio support
let runner = try MultimodalRunner(
    modelPath: "model.pte",
    visionPath: "vision.pte",
    audioPath: "audio.pte",
    tokenizerPath: tokenizerPath,
    temperature: 0.7
)

// Process audio and image inputs
let audioTensor = AudioProcessor.preprocess(audioURL)
let imageTensor = ImageProcessor.preprocess(uiImage)

// Generate with audio + vision + text
let result = try runner.generateMultimodal(
    prompt: "Describe what you hear and see",
    audio: audioTensor,
    image: imageTensor,
    maxTokens: 512
)

// Stream tokens to UI
result.tokens.forEach { token in
    DispatchQueue.main.async {
        responseText += token
    }
}

import org.pytorch.executorch.MultimodalRunner
import android.media.MediaRecorder

// Initialize multimodal runner with audio
val runner = MultimodalRunner.create(
    modelPath = "model.pte",
    visionPath = "vision.pte",
    audioPath = "audio.pte",
    tokenizerPath = tokenizerPath,
    temperature = 0.7f
)

// Process audio and image inputs
val audioTensor = AudioProcessor.preprocess(audioFile)
val imageTensor = ImageProcessor.preprocess(bitmap)

// Generate with audio + vision + text
val result = runner.generateMultimodal(
    prompt = "Describe what you hear and see",
    audio = audioTensor,
    image = imageTensor,
    maxTokens = 512
)

// Display streaming response
result.tokens.forEach { token ->
    runOnUiThread {
        responseView.append(token)
    }
}

High-level APIs abstract away model complexity - just load, prompt, and get results

Explore LLM APIs →

Comprehensive Hardware Ecosystem

12+ hardware backends with acceleration contributed by industry partners via open source

XNNPACK with Arm Kleidi

CPU acceleration across Arm and x86 architectures

Apple Core ML for Apple silicon

Neural Engine and Apple Silicon optimization

Qualcomm® AI Engine for Qualcomm® Hexagon™ NPU

Hardware-accelerated AI inference on Qualcomm platforms

Arm Ethos-U NPU

Microcontroller NPU for ultra-low power

Vulkan GPU

Cross-platform graphics acceleration

OpenVINO from Intel

x86 CPU and integrated GPU optimization

MediaTek NPU

Dimensity chipset acceleration

Samsung Exynos NPU

Integrated NPU optimization

NXP Semiconductors' eIQ® Neutron NPU

Automotive and IoT acceleration

Apple Metal Performance Shaders (MPS)

GPU acceleration on macOS and iOS

Arm VGF

Versatile graphics framework support

Cadence DSP

Digital signal processor optimization

→ View detailed backend documentation

Why On-Device AI Matters

Enhanced Privacy

Real-Time Response

Offline & Low-Bandwidth Ready

Cost Efficient

Models Are Getting Smaller & Smarter

Why On-Device AI Was Hard

Power Constraints

Thermal Management

Memory Limitations

Hardware Heterogeneity

PyTorch Powers >90% of AI Research

Research & Training

The Conversion Nightmare

The Hidden Costs of Conversion (Status Quo)

ExecuTorchPyTorch's On-Device AI Framework

No Conversions

Ahead-of-Time Compilation

Modular by Design

Hardware Ecosystem

Embedded-Friendly Runtime

PyTorch Ecosystem

Simple as 1-2-3

1. Export Your PyTorch Model

2. Optimize for Target Hardware

3. Run on Any Platform

High-Level Multimodal APIs

Multimodal Runner - Text + Vision + Audio in One API

Universal AI Runtime

Comprehensive Hardware Ecosystem

XNNPACK with Arm Kleidi

Apple Core ML for Apple silicon

Qualcomm® AI Engine for Qualcomm® Hexagon™ NPU

Arm Ethos-U NPU

Vulkan GPU

OpenVINO from Intel

MediaTek NPU

Samsung Exynos NPU

NXP Semiconductors' eIQ® Neutron NPU

Apple Metal Performance Shaders (MPS)

Arm VGF

Cadence DSP

Success Stories

Production Deployments

Ecosystem Integration

Examples & Models

Ready to Deploy AI at the Edge?

ExecuTorch
PyTorch's On-Device AI Framework