The On-Device Inference Landscape Has Fractured
Two years ago, the on-device AI decision was simple. If you were building for Apple, you used Core ML. If you needed cross-platform, you reached for TensorFlow Lite. That world is gone.
Apple introduced Core AI at WWDC 2028 as a higher-level framework purpose-built for running large language models on Apple Silicon. Meta shipped ExecuTorch as a production-ready PyTorch mobile runtime that works on iOS, Android, and embedded devices. Core ML is still around and still useful, but its role has narrowed. You now have three distinct frameworks competing for on-device inference workloads, each with different strengths, different model format requirements, and different hardware optimization paths.
The confusion is real. We have talked to teams who spent months integrating Core ML for an LLM workload that Core AI handles natively in a fraction of the code. We have seen teams lock themselves into Core AI only to realize they also need an Android app and now face a complete re-architecture. And we have watched teams pick ExecuTorch for simple vision tasks that Core ML handles with less overhead and better performance on Apple hardware.
This guide breaks down what each framework actually does, what model types and sizes each supports, how they map to hardware accelerators, and most importantly, when you should choose each one. No framework is universally better. The right choice depends on your model type, your target platforms, your performance requirements, and whether you value Apple ecosystem integration or cross-platform reach.
Core AI: Apple's LLM-Native Framework
Core AI is Apple's answer to a problem Core ML was never designed to solve: running multi-billion parameter language models efficiently on device. Introduced in iOS 19 and macOS 16, Core AI provides first-class support for autoregressive text generation, multimodal reasoning, and long-context inference on Apple Silicon.
The framework is built around a few key concepts. First, it ships with Apple's own foundation models pre-installed on every compatible device. You do not bundle a 2GB model in your app binary. The system provides access to models ranging from a 1B parameter distilled variant up to a 7B parameter model on devices with sufficient memory. Second, Core AI manages the KV-cache, tokenization, and sampling pipeline internally. You feed it a prompt and receive a streaming AsyncSequence of tokens. The days of manually managing memory buffers for transformer inference on iOS are over.
Hardware requirements are the first major consideration. Core AI requires an A18 Pro or later on iPhone, M3 or later on iPad and Mac. That is a narrower hardware floor than Core ML, which runs on devices going back to the A11. If your app targets a broad consumer audience, a significant portion of your users will not have Core AI-capable hardware.
Where Core AI genuinely shines is inference performance for generative workloads. On an M5 chip, the 3B parameter model generates tokens at 45-55 tokens per second. The 7B model runs at 22-28 tokens per second. These numbers are competitive with running the same model sizes on a mid-range desktop GPU. Core AI achieves this by deeply integrating with the Neural Engine's matrix multiplication units and using Apple's custom memory management to keep model weights in unified memory without constant copying between CPU and accelerator address spaces.
The model format is proprietary. Core AI uses .coreai bundles that you either access from Apple's system-provided models or convert from supported formats using Apple's CoreAITools Python package. The conversion pipeline supports HuggingFace Transformers models, GGUF, and SafeTensors as source formats. Quantization to 4-bit and 8-bit precisions happens during conversion, with Apple's custom quantization scheme that preserves accuracy better than naive round-to-nearest on attention layers.
The biggest limitation is obvious: Core AI is iOS and macOS only. There is no Android equivalent, no web runtime, no Linux support. If you need your model to run anywhere other than Apple devices, Core AI cannot be your only framework. It is also limited to transformer-based architectures. If you are running a CNN for image classification or a custom audio model, Core AI is not the right tool. That is what Core ML is for.
Core ML: The Workhorse for Traditional ML Models
Core ML has been shipping since iOS 11 and it remains Apple's general-purpose machine learning framework. While Core AI took over the LLM workload, Core ML is still the best choice for vision models, audio classifiers, tabular prediction, and any model architecture that is not a large language model.
The framework supports a wide range of model types: convolutional neural networks, recurrent networks, tree ensembles, support vector machines, and traditional statistical models. Model format is .mlmodel (or the compiled .mlmodelc), and the conversion ecosystem is mature. coremltools converts models from PyTorch, TensorFlow, JAX, ONNX, scikit-learn, and XGBoost. You are unlikely to encounter a popular model architecture that coremltools cannot handle.
Hardware acceleration in Core ML is more flexible than Core AI. The framework dynamically dispatches compute across three backends: the Neural Engine (ANE), the GPU via Metal, and the CPU. You can specify compute unit preferences per model, but in practice, letting Core ML auto-select usually produces the best results. The Neural Engine handles convolutional and matrix operations fastest. The GPU picks up workloads with heavy element-wise operations. The CPU handles operations the other two do not support natively.
Performance for vision tasks is excellent. A MobileNetV3 image classifier runs in 2-4ms on the Neural Engine. A YOLOv8 object detection model processes a frame in 8-12ms, well within the budget for real-time 30fps video analysis. EfficientNet-B0 inference takes 5-7ms. These latencies make Core ML the clear winner for any real-time vision pipeline on iOS.
Memory footprint is where Core ML has a structural advantage over the other frameworks. Vision and audio models are typically 5-50MB. Even large segmentation models rarely exceed 200MB. Compare that to the multi-gigabyte footprint of LLM inference with Core AI or ExecuTorch. If your app runs on older iPhones with 4GB of RAM, Core ML models leave plenty of headroom for the rest of your app.
The model update story is also better with Core ML. You can use On-Demand Resources to download updated models without requiring a full app update through the App Store. This means you can retrain a classification model weekly and push the updated weights to users within hours. Core AI's system-provided models, by contrast, update only with OS releases.
Core ML's weakness is LLM inference. You can technically convert a transformer model to Core ML format and run it, but the framework lacks streaming token generation, KV-cache management, and the memory optimizations that make large model inference practical. If you tried running a 3B parameter model through Core ML, you would hit memory limits, suffer from poor token throughput (5-8 tokens per second versus Core AI's 45-55), and write significantly more boilerplate code to manage the generation loop. For generative text workloads, Core AI is the right tool on Apple platforms.
ExecuTorch: Meta's Cross-Platform PyTorch Runtime
ExecuTorch is Meta's answer to a problem that has plagued the PyTorch ecosystem for years: taking a model trained in PyTorch and running it efficiently on mobile and edge devices. Unlike TensorFlow Lite, which required converting models to a completely different format and often lost fidelity in translation, ExecuTorch works directly with PyTorch's export system. You train a model in PyTorch, export it with torch.export, and ExecuTorch runs the exported artifact on device.
The cross-platform story is ExecuTorch's defining advantage. The same exported model runs on iOS, Android, embedded Linux, and microcontrollers. On iOS, ExecuTorch delegates to the Core ML backend or the Metal Performance Shaders backend for hardware acceleration. On Android, it delegates to Qualcomm's QNN SDK for Hexagon NPU acceleration, ARM's XNNPACK for CPU-optimized inference, or the Vulkan GPU backend. You write one model export pipeline and deploy to every platform your users care about.
Model support is broad. ExecuTorch handles transformer architectures (including Meta's Llama family), convolutional networks, diffusion models, speech models, and custom architectures. If PyTorch can train it and torch.export can capture it, ExecuTorch can run it. This flexibility matters if your team has invested in custom model architectures that do not fit neatly into Apple's supported formats.
Performance varies significantly by platform and backend. On iOS with the Core ML delegate, ExecuTorch's vision model performance is within 10-15% of native Core ML. The gap comes from the additional abstraction layer and data format conversions. For LLM inference on Apple Silicon, ExecuTorch with the Metal backend runs Llama 3 8B at 12-18 tokens per second on M5 hardware. That is notably slower than Core AI's 22-28 tokens per second for a comparable model size, because ExecuTorch does not have the same deep Neural Engine integration.
On Android, ExecuTorch is often the best option available. Running Llama 3.2 1B on a Snapdragon 8 Gen 4 with the QNN delegate produces 25-35 tokens per second. Qualcomm has invested heavily in optimizing the ExecuTorch-QNN path, and it shows. For Android-first or Android-included projects, ExecuTorch is the strongest choice for on-device LLM inference.
The developer experience requires more manual work than Core ML or Core AI. You need to manage model export, choose and configure delegates, handle memory allocation, and build the inference pipeline. There is no drag-and-drop Xcode integration. The upside is control: you can profile exactly where time is spent, swap backends, adjust quantization per layer, and optimize for your specific hardware targets. Teams with ML engineering expertise will find this control valuable. Teams without it will find it burdensome.
Inference Benchmarks and Memory: Head-to-Head Comparison
Abstract framework descriptions are useful, but numbers settle arguments. Here are benchmarks from our testing on current hardware, covering both LLM and vision workloads across all three frameworks.
LLM Inference (3B Parameter Model, 4-bit Quantization)
Testing on iPhone 16 Pro (A18 Pro chip, 8GB RAM):
- Core AI: 42 tokens/sec generation, 1.8GB peak memory, 1.2 sec time-to-first-token
- Core ML (manual transformer pipeline): 7 tokens/sec generation, 2.4GB peak memory, 3.8 sec time-to-first-token
- ExecuTorch (Metal delegate): 18 tokens/sec generation, 2.1GB peak memory, 2.0 sec time-to-first-token
Core AI's advantage on Apple hardware is decisive for generative text. The 6x throughput gap versus Core ML and 2.3x gap versus ExecuTorch come from Core AI's tight Neural Engine integration and optimized KV-cache management. If you are building an LLM-powered feature exclusively for Apple devices, there is no reason to use anything else.
Image Classification (MobileNetV3-Large)
Testing on iPhone 16 Pro:
- Core ML: 2.8ms per inference, 12MB memory overhead
- ExecuTorch (Core ML delegate): 3.4ms per inference, 18MB memory overhead
- ExecuTorch (XNNPACK delegate): 6.1ms per inference, 15MB memory overhead
Core AI does not support CNN-based classifiers, so it is not in this comparison. Core ML wins on its home turf, with ExecuTorch's Core ML delegate coming close but adding overhead from the abstraction layer.
Object Detection (YOLOv8-Small)
Testing on iPhone 16 Pro:
- Core ML: 9.2ms per frame, 45MB memory
- ExecuTorch (Core ML delegate): 11.8ms per frame, 52MB memory
Both are comfortably within the 33ms budget for 30fps real-time detection. Core ML is faster, but ExecuTorch gives you the same model running on Android without conversion.
Cross-Platform LLM (Llama 3.2 1B, 4-bit Quantization)
Testing across devices:
- ExecuTorch on iPhone 16 Pro (Metal): 32 tokens/sec
- ExecuTorch on Samsung Galaxy S25 Ultra (QNN): 28 tokens/sec
- ExecuTorch on Pixel 9 Pro (XNNPACK CPU): 14 tokens/sec
ExecuTorch is the only framework that gives you these cross-platform numbers from a single model artifact. The performance gap between devices reflects the hardware acceleration available: Apple's Metal and Qualcomm's QNN deliver roughly 2x the throughput of CPU-only inference.
The memory story matters more than most teams realize. On a device with 6GB of RAM where the OS and other apps consume 3-4GB, your app has perhaps 2GB to work with before the system starts killing background processes. A 3B parameter model in 4-bit quantization needs 1.5-2.4GB depending on the framework. Core AI is the most memory-efficient because it shares model weights with the operating system's own inference workloads. ExecuTorch and Core ML load their own copies into your app's memory space.
Model Conversion Workflows and Developer Experience
Getting a model from your training pipeline to running on device is where the practical friction lives. Each framework has a different conversion workflow, and the quality of that pipeline directly affects your development velocity.
Core AI Conversion
Apple's CoreAITools Python package handles conversion from HuggingFace Transformers, GGUF, and SafeTensors formats. The pipeline validates architecture compatibility, applies Apple's custom quantization, and outputs a .coreai bundle. Conversion for a 3B parameter model takes 15-30 minutes on an M-series Mac. The tooling is opinionated: it works extremely well for supported architectures (Llama-style decoders, encoder-decoder transformers) and fails entirely for unsupported ones. If your model architecture is not in Apple's compatibility list, you cannot use Core AI. Period.
Core ML Conversion
coremltools is the most mature conversion ecosystem of the three. It supports PyTorch (via TorchScript and torch.export), TensorFlow, JAX, ONNX, scikit-learn, XGBoost, and LibSVM. The flexible conversion pipeline lets you specify compute precision, input shapes, and optimization hints. Conversion failures are rare for standard architectures and well-documented when they occur. The community is large, Stack Overflow has thousands of solved coremltools issues, and Apple's documentation is genuinely helpful.
One significant advantage: Xcode integrates Core ML model previews directly. You can drag a .mlmodel file into your project, preview input/output shapes, test inference with sample data, and profile performance, all without writing code. This drastically reduces the iteration cycle when experimenting with different model architectures or quantization levels.
ExecuTorch Export
ExecuTorch uses PyTorch's native export system (torch.export) followed by ExecuTorch-specific lowering passes that optimize the model for mobile execution. The workflow is: train in PyTorch, export with torch.export.export(), lower with ExecuTorch's to_edge() and to_executorch() APIs, then deploy the .pte artifact.
The advantage is that you stay in the PyTorch ecosystem throughout. If your ML team already trains in PyTorch, the export step is a natural extension of the existing workflow. The disadvantage is that torch.export has compatibility constraints. Not every PyTorch operation is exportable, and dynamic control flow (if statements that depend on tensor values) requires careful refactoring. Meta publishes a compatibility guide, but in practice, expect to spend time debugging export failures for complex custom architectures.
Platform Support Matrix
Here is the practical compatibility picture:
- Core AI: iOS 19+, macOS 16+, visionOS 3+. No Android, no web, no embedded.
- Core ML: iOS 11+, macOS 10.13+, watchOS 4+, tvOS 11+, visionOS 1+. No Android, no web.
- ExecuTorch: iOS 15+, Android 10+, embedded Linux (ARM64), select microcontrollers. No web, no watchOS.
If you are building an app that needs to run inference on both iOS and Android, ExecuTorch is your only option from this list that avoids maintaining two completely separate model pipelines. If you are Apple-only and working with traditional ML models, Core ML's tooling and Xcode integration make it the fastest path to production. If you are Apple-only and working with LLMs, Core AI eliminates an enormous amount of infrastructure code. For a broader look at the on-device versus cloud tradeoff that informs these framework choices, our on-device AI vs cloud AI comparison covers the strategic considerations.
Privacy, Offline Capability, and Use Case Matching
All three frameworks run inference entirely on device, which means all three deliver the same fundamental privacy benefit: user data never leaves the device. But the privacy story has nuances worth understanding.
Core AI's system-provided models are the strongest privacy option because Apple manages the entire model lifecycle. There is no model download from your servers, no telemetry about inference requests, and no way for your app code to exfiltrate model weights. Apple's privacy nutrition labels automatically reflect that your app processes data on-device when you use Core AI APIs. If your app handles health data under HIPAA or financial data under PCI-DSS, Core AI simplifies your compliance documentation considerably.
Core ML models that you bundle with your app carry a different privacy posture. The inference is on-device, but you are responsible for how the model was trained. If your training data contained personal information, you have data provenance obligations even though inference is local. You also need to ensure that your model file itself does not inadvertently memorize sensitive training data, a known risk with large models.
ExecuTorch's privacy story depends on your deployment model. If you bundle the .pte file with your app, the analysis is similar to Core ML. If you download models from your server at runtime (which ExecuTorch supports), you need to consider the network request itself: the server sees which model the user is downloading, which can leak information about their device capabilities or feature usage patterns. Encrypt model downloads and minimize metadata in those requests.
Offline capability is identical across all three: every framework runs inference without a network connection once the model is on device. The difference is model availability. Core AI models are pre-installed with the OS. Core ML models ship in your app bundle or via On-Demand Resources. ExecuTorch models can be bundled or downloaded. If guaranteed offline support from first launch is critical (field service apps, emergency response tools, military applications), bundling the model is the safest approach regardless of framework.
When to Use Each Framework
After testing all three extensively, here is our practical decision tree:
- Use Core AI when you are building an Apple-only app that needs LLM capabilities (text generation, summarization, conversational AI, document analysis). The performance advantage is too large to ignore, and the API surface eliminates weeks of infrastructure code.
- Use Core ML when you are building an Apple-only app with vision, audio, or traditional ML workloads (image classification, object detection, speech recognition, pose estimation, recommendation engines). Also use Core ML when you need to support older devices that do not meet Core AI's hardware requirements.
- Use ExecuTorch when you need the same model running on both iOS and Android, when you have custom PyTorch architectures that do not convert cleanly to Apple formats, or when you need fine-grained control over the inference pipeline for performance optimization. ExecuTorch is also the right choice if your ML team is PyTorch-native and you want to minimize the gap between training and deployment.
Many production apps will use more than one framework. A cross-platform messaging app might use ExecuTorch for its custom language model on both platforms while also using Core ML for real-time camera effects on iOS. An Apple-only health app might use Core AI for generating patient-friendly summaries of medical data while using Core ML for real-time sensor classification from Apple Watch. Mixing frameworks within a single app is not just acceptable. It is often the optimal architecture. If you are earlier in the process and still deciding whether to build on-device AI features at all, our guide on how to build an on-device AI mobile app walks through the full planning process.
Hybrid Architectures and Choosing Your Path Forward
The most capable on-device AI apps in production today do not rely on a single framework. They combine frameworks strategically, matching each inference task to the runtime that handles it best. Here is how to architect that effectively.
The pattern that works is an abstraction layer that routes inference requests. Define a protocol (or interface, in cross-platform terms) for each AI capability your app needs: TextGeneration, ImageClassification, ObjectDetection, SpeechRecognition. Behind each protocol, implement platform-specific executors. On iOS, your TextGeneration implementation uses Core AI. Your ImageClassification implementation uses Core ML. On Android, both use ExecuTorch with appropriate delegates. Your application code never references a specific framework directly.
This architecture gives you three practical benefits. First, you can benchmark frameworks against each other on real workloads by swapping implementations behind the protocol. Second, you can fall back gracefully when a user's device does not support a particular framework (Core AI unavailable on older iPhones, for example, falls back to an ExecuTorch implementation or a cloud API call). Third, you can upgrade to newer frameworks as they ship without rewriting your application logic.
Model management is the operational challenge that most teams underestimate. If your app uses a Core AI model for text, a Core ML model for vision, and an ExecuTorch model for a custom audio classifier, you are managing three different model formats, three different conversion pipelines, and three different update mechanisms. Build a model registry in your CI/CD pipeline that tracks model versions, conversion configs, and target platforms. Automate the conversion step so that a new model checkpoint in your training pipeline automatically produces artifacts for each framework you deploy to.
Testing deserves special attention. On-device inference is deterministic for a given input on a given device, but results vary across devices due to different hardware backends and numerical precision. A classification model might return 0.92 confidence on an A18 Pro and 0.89 on an A16 for the same input because the Neural Engine versions handle floating-point arithmetic differently. Build your test suite to validate against accuracy ranges rather than exact values, and test on the oldest supported hardware, not just the newest.
Looking ahead, the framework landscape will continue evolving. Apple will likely expand Core AI to cover more model architectures beyond transformers. Meta is investing heavily in ExecuTorch's performance on Qualcomm and MediaTek hardware. Core ML will remain the stable workhorse for traditional ML on Apple platforms. The teams that build with a clean abstraction layer today will be best positioned to adopt improvements from any framework without painful re-architecture.
If you are planning an on-device AI project and want help navigating the framework decision for your specific use case, we have built across all three frameworks and can help you avoid the expensive wrong turns. Book a free strategy call and we will walk through your architecture options together. For more context on how Apple's broader AI ecosystem fits into this picture, our Apple Intelligence SDK guide covers the full stack beyond just inference frameworks.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.