Technology·16 min read

Apple Core AI Framework: On-Device LLMs for iOS Apps Guide

Apple's Core AI framework gives iOS developers direct access to on-device LLM inference optimized for Apple Silicon. Here is how it works, how it differs from Core ML, and how to build with it.

Nate Laquis

Nate Laquis

Founder & CEO

What Core AI Actually Is and Why Apple Built It

Core AI is Apple's dedicated framework for running large language models directly on device. It shipped with iOS 20 and macOS 17, and it solves a problem that Core ML was never designed for: efficient, streaming LLM inference with conversational context management. If you have tried running transformer-based language models through Core ML, you know the pain. Core ML treats every model as a stateless function: input goes in, output comes out. That works for image classifiers and object detectors. It does not work for autoregressive text generation where you need to manage KV caches, handle variable-length sequences, and stream tokens one at a time.

Core AI was built specifically for this. It provides a high-level Swift API that handles tokenization, KV cache management, sampling strategies (temperature, top-p, repetition penalties), and streaming output through AsyncSequence. Under the hood, it runs inference on Apple's Neural Engine with automatic fallback to GPU and CPU when the Neural Engine is saturated.

The practical result: you can ship an app that runs a 3B parameter language model on an iPhone 16 Pro with first-token latency under 200ms and sustained generation at 25-35 tokens per second. No cloud calls. No API keys. No per-token billing. The model ships as part of the OS (for Apple's foundation models) or as part of your app bundle (for custom models you have converted).

Modern iPhones and iPads running AI-powered applications on Apple Silicon

Apple built Core AI because the on-device AI landscape was fragmenting. Developers were pulling in ExecuTorch, ONNX Runtime, or llama.cpp to run language models on iOS, and each option required wrestling with Metal shader compilation, memory management, and tokenizer integration. Core AI consolidates all of that into a first-party framework optimized for Apple hardware, integrated with the system scheduler, and covered by Apple's standard API stability guarantees.

For teams already building with Apple Intelligence, Core AI is the layer underneath. Apple Intelligence features like Writing Tools and Siri use Core AI internally. The difference is that Core AI gives you direct model access, while Apple Intelligence gives you pre-built features. If you want summarization with default settings, use Apple Intelligence. If you want to control the prompt, adjust generation parameters, or run your own fine-tuned model, Core AI is what you need.

Core AI vs Core ML: When to Use Which

The most common question developers ask is whether Core AI replaces Core ML. It does not. They solve different problems, and most production apps will use both.

Core ML is a general-purpose inference engine for any machine learning model. It handles CNNs for image classification, random forests for tabular data, and transformer encoders for embeddings. Core ML models are stateless: you feed in an input tensor, get an output tensor. The framework does not manage conversation history, streaming output, or autoregressive decoding. It is excellent for tasks with fixed input/output shapes.

Core AI is purpose-built for generative language models. It handles the entire autoregressive generation pipeline: tokenization, prompt encoding, KV cache management, token-by-token generation with configurable sampling, and streaming output. It understands conversation structure (system prompts, user turns, assistant turns) and manages context windows automatically.

Think of it this way. If your model produces a classification label, a bounding box, an embedding vector, or a regression score, use Core ML. If your model generates text token by token, use Core AI. Some architectures (vision-language models, multimodal models) use both: Core ML processes the image encoder and Core AI handles the language decoder.

There are important technical differences in hardware resource management. Core ML dispatches requests to the Neural Engine, GPU, or CPU based on a per-layer plan computed at compile time. Core AI reserves a dedicated Neural Engine execution context that persists across generation steps, avoiding the overhead of re-dispatching on every token. The persistent context also keeps KV cache tensors resident on the Neural Engine's local memory rather than shuttling them through main memory on each step.

One practical consequence: Core AI and Core ML compete for Neural Engine bandwidth. If you run a Core ML vision model for camera processing while simultaneously running a Core AI language model, one gets bumped to GPU execution. Stagger your workloads, or accept GPU-speed inference for the lower-priority task.

Supported Model Formats, Sizes, and Device Compatibility

Core AI supports two model format categories: Apple's own foundation models that ship with the OS, and custom models that you convert and bundle with your app.

Apple's built-in models include a 3B parameter general-purpose language model (the same one powering Apple Intelligence features), a 1.5B parameter "fast" model optimized for low-latency tasks like autocomplete and entity extraction, and specialized models for code, translation, and summarization. These models live on the device filesystem, are shared across all apps, and get updated with OS updates. You do not need to bundle them in your app, which keeps your app size manageable.

For custom models, Core AI uses the .coreaimodel format: a compiled package containing model weights, tokenizer configuration, generation parameters, and optional LoRA adapter weights. You create .coreaimodel files using the Core AI Model Converter in Xcode, which accepts PyTorch (.pt, .bin), Hugging Face Safetensors (.safetensors), and GGUF (.gguf) formats. The converter handles quantization (4-bit, 8-bit, mixed precision), layer fusion, and KV cache optimization automatically.

Model size limits depend on the target device. iPhone 16 Pro and later (with 8GB unified memory) can run models up to approximately 4B parameters in 4-bit quantization, which occupies about 2GB of memory including KV cache overhead. The standard iPhone 16 (6GB RAM) handles models up to about 2.5B parameters comfortably. iPad Pro with M-series chips (16GB+ RAM) can run models up to 8B parameters in 4-bit quantization, though generation speed drops to 10-15 tokens per second at that size.

Developer writing Swift code for on-device language model integration

Device compatibility is strict. Core AI requires an A17 Pro or later for iPhone, or M1 or later for iPad and Mac. There is no software fallback to CPU-only execution. Apple made this decision deliberately: LLMs on older hardware produce such poor performance that it would damage user perception of on-device AI. If your app targets devices older than iPhone 15 Pro, you need a fallback path via cloud API or feature gating.

For the larger models (7B+), you realistically need an iPad Pro or Mac. iPhone 16 Pro can technically load a 7B model with aggressive quantization, but generation speeds drop below 5 tokens per second and the system will aggressively terminate your app under memory pressure. Our recommendation: target 1.5B to 3B parameter models for iPhone, reserve larger models for iPad and Mac, and build your UX around those constraints from the start.

Privacy and Latency: The Two Killer Advantages

On-device LLM inference through Core AI provides two advantages that no cloud API can match: absolute data privacy and consistent low latency. For certain categories of apps, these are the entire reason to choose on-device over cloud.

Privacy first. When you run inference through Core AI, user data never leaves the device. There is no network request, no server log, no third-party data processing agreement to negotiate. The text your user types into a medical symptom checker, the financial data they ask an AI assistant to analyze, the personal journal entries they want summarized: all of it stays on their hardware. For apps in healthcare (HIPAA), finance (SOC 2, PCI), legal (attorney-client privilege), and education (COPPA, FERPA), this is transformative. You can ship AI features with zero additional regulatory burden because the data never leaves the device.

Apple reinforces this architecturally. Core AI runs in a sandboxed process with no network access. Even if your app has network entitlements, the inference engine itself cannot make network calls. This is a level of privacy assurance that no cloud-based or self-hosted inference solution can match.

Latency is the second advantage. Cloud LLM APIs typically have 500ms to 2s of first-token latency due to network round trip and queue wait. Core AI delivers first-token latency of 100-200ms on iPhone 16 Pro hardware. For interactive features like inline text suggestions, real-time entity extraction, or conversational assistants, that difference is the gap between an experience that feels instant and one that feels sluggish.

Sustained generation speed matters too. Core AI generates 25-35 tokens per second for a 3B model on A18 Pro hardware, roughly 100-140 words per minute, faster than most people read. Compare that to cloud APIs during peak hours, where generation can slow to 10-15 tokens per second due to server load.

There is also the offline angle. Core AI works without any network connection. Your app's AI features function identically on a plane, in a subway tunnel, or in a rural area with no cell signal. For field service apps, travel apps, and any app used in connectivity-constrained environments, offline capability is a hard requirement. If you are evaluating the broader tradeoffs, our on-device AI vs cloud AI comparison covers the full decision framework.

Use Cases and Swift API Patterns for Production Apps

Let's get into what you can actually build with Core AI and how the Swift API works in practice. These are patterns we have implemented in production apps, not theoretical examples.

Summarization is the most straightforward use case. You create a CoreAISession with a model identifier, configure generation parameters (max tokens, temperature), submit a prompt with the source text and summarization instruction, and iterate over the resulting AsyncSequence of tokens. If your source text exceeds the context limit, the framework automatically applies chunked summarization.

Entity extraction is where Core AI outperforms regex dramatically. Instead of brittle pattern-matching rules, you prompt the model to extract structured data from unstructured text. "Extract all company names, dollar amounts, and dates from this email" produces JSON-formatted output you parse into Swift structs. For best results, use Apple's 1.5B fast model for extraction. It is 2x faster than the 3B model and equally accurate for structured tasks.

Conversational assistants are the highest-value use case. Core AI manages multi-turn history automatically through its CoreAIConversation API. You append user messages, the framework generates responses, and conversation context persists across turns with KV cache reuse. Subsequent turns are faster than the first because the model skips re-encoding prior history. First-turn latency runs 150-200ms, subsequent turns 50-100ms.

Code generation is surprisingly capable with the 3B model for common patterns. We have built an app that generates SwiftUI view code from natural language descriptions, producing correct, compilable code about 70% of the time for standard layouts. For a domain-specific coding assistant (SQL queries, API requests, data models), a fine-tuned 3B model on-device replaces cloud API calls for the majority of common queries.

One pattern that works exceptionally well: hybrid classification plus generation. Use a lightweight Core ML classifier (under 10MB) to categorize the user's input, then route to a specific Core AI prompt template based on the classification. A customer support app classifies whether the user asks about billing, technical issues, or account management, then generates a response with targeted domain context. This two-stage approach beats a single generic prompt every time.

Hybrid On-Device and Cloud Fallback Architecture

No responsible engineering team ships a production app that depends entirely on on-device inference without a fallback strategy. Device capabilities vary. Some queries exceed what a 3B model can handle. Users on older hardware need the feature too. The right architecture is hybrid: on-device by default, cloud when necessary.

The pattern we recommend is a capability-aware routing layer. At app launch, you query CoreAISession.availableModels() to determine what the device supports. You check available memory, Neural Engine capability, and thermal state. Based on these signals, you configure a routing policy: queries below a complexity threshold run on-device, everything else routes to your cloud backend. The complexity threshold is something you tune empirically. For a summarization feature, "summarize this 500-word email" runs on-device while "analyze this 50-page contract and identify all liability clauses" routes to a cloud model with a larger context window.

Implementing the fallback cleanly requires a protocol abstraction. Define a LLMProvider protocol with methods like generate(prompt:parameters:) returning an AsyncSequence of tokens. Create two conforming implementations: CoreAIProvider for on-device and CloudLLMProvider for your cloud backend. Your feature code programs against the protocol and never knows which implementation handles the request. A router decides at runtime based on device capability, query complexity, and network availability.

Edge cases matter. If a user starts a cloud-routed generation and loses connectivity, detect this before starting and prefer on-device when the network is unreliable. Caching partial cloud responses creates confusing UX. The cleaner approach is proactive connectivity checking.

Data center servers powering cloud AI inference as fallback for on-device models

Thermal management is another signal to monitor. After 60-90 seconds of sustained Neural Engine inference on iPhone, thermal throttling drops performance by 20-30%. Monitor ProcessInfo.thermalState and route to cloud at the .serious level for continuous inference workloads like real-time transcription.

The cost impact is significant. By running 80-90% of queries on-device and routing only complex queries to the cloud, you cut inference bills dramatically. For a consumer app with 100,000 daily active users, we have seen teams reduce monthly inference costs from $15,000 to under $2,000 with this hybrid approach. For a deeper look at building on-device AI apps from scratch, our guide to building on-device AI mobile apps covers the full engineering process.

Core AI vs ExecuTorch vs ONNX Runtime: Making the Right Choice

Core AI is not the only option for running LLMs on iOS. ExecuTorch (Meta's on-device inference engine for PyTorch models) and ONNX Runtime (Microsoft's cross-platform inference engine) both support iOS deployment. Each has legitimate strengths, and the right choice depends on your constraints.

Core AI wins on performance and integration. As a first-party framework, it has direct access to Neural Engine scheduling and hardware acceleration paths that third-party frameworks cannot use. In our benchmarks, Core AI runs the same 3B parameter model 30-40% faster than ExecuTorch and 25-35% faster than ONNX Runtime on identical iPhone 16 Pro hardware. It also integrates cleanly with SwiftUI, Combine, and the rest of the Apple stack. No bridging headers, no C++ interop, no manual memory management.

ExecuTorch wins on model flexibility and cross-platform reach. If you have a custom PyTorch model that also needs to run on Android, ExecuTorch lets you use the same model artifact on both platforms. If your model uses custom operators or non-standard layer types, ExecuTorch is more likely to support them. The tradeoff: ExecuTorch on iOS uses Metal Performance Shaders rather than direct Neural Engine access, so you leave significant performance on the table.

ONNX Runtime occupies the middle ground. It supports a wide range of model formats, runs on iOS, Android, Windows, Linux, and web (via WebAssembly), and offers reasonable performance through its Core ML execution provider. However, the translation layer introduces overhead, and ONNX Runtime's LLM support through the GenAI extension is less mature for autoregressive generation.

Here is our decision framework. Choose Core AI if you are building iOS-first, want the best possible performance on Apple hardware, and use standard transformer architectures (GPT-style, Llama-style, Mistral-style). Choose ExecuTorch if you need the same model on iOS and Android or your team's ML expertise is in PyTorch. Choose ONNX Runtime if you need to deploy across four or more platforms including web.

Performance comparison on iPhone 16 Pro with a 3B Llama-architecture model in 4-bit quantization: Core AI achieves 32 tokens/second with 180ms first-token latency. ExecuTorch achieves 20 tokens/second with 350ms first-token latency. ONNX Runtime achieves 22 tokens/second with 300ms first-token latency. The relative ranking stays consistent across hardware generations, but Core AI's advantage grows on newer Apple Silicon.

One more consideration: app size. Core AI adds zero overhead for Apple's built-in models (they ship with the OS). ExecuTorch adds approximately 15-20MB for the runtime library. ONNX Runtime adds approximately 25-30MB. For consumer apps where download size matters, this difference is meaningful. For more context on the broader Apple Intelligence SDK and how Core AI fits within it, that guide covers the full platform story.

Getting Started: Integration Checklist and Next Steps

If you are ready to integrate Core AI into your iOS app, here is the practical checklist we follow with every client project. This is not theory. It is the sequence of decisions and implementation steps that gets you from zero to a shipping on-device LLM feature.

Step one: define your model requirements. What task does your LLM feature perform? Summarization, entity extraction, conversational Q&A, code generation? The task determines which model you need. For summarization and extraction, Apple's built-in 1.5B fast model is sufficient. For conversational features or domain-specific tasks, you will likely need the 3B general model or a custom fine-tuned model. Document your expected input length, required output quality, and acceptable latency before writing any code.

Step two: validate device compatibility. Run CoreAISession.isSupported on your minimum target device. If your app supports devices older than iPhone 15 Pro, build your fallback path (cloud API or feature gating) before building the on-device path. This prevents the common mistake of building a polished on-device feature and then rushing a mediocre fallback at the last minute.

Step three: build your prompt templates. Core AI model quality depends heavily on prompt engineering, just like cloud LLMs. Write your system prompts, test them with representative inputs, and iterate until output quality meets your bar. Use the Core AI Playground in Xcode to test prompts interactively. A well-crafted prompt on a 3B model often outperforms a lazy prompt on a 7B model.

Step four: implement the streaming UI. Build your token-by-token text rendering component before integrating Core AI. Use a timer-based mock that emits tokens at 30 per second to develop the UI independently. When Core AI integration is ready, swap the mock for the real AsyncSequence.

Step five: add instrumentation. Log generation latency (first token, total), token count, model load time, thermal state, and routing decisions. This telemetry is essential for optimizing performance post-launch.

Step six: test on real hardware. Simulator testing is insufficient because it lacks a Neural Engine. You need physical devices: iPhone 15 Pro (minimum spec), iPhone 16 Pro (target spec), and at least one M-series iPad. Test under memory pressure, at high thermal state, and offline. These are the conditions your users will actually encounter.

The on-device LLM space is moving fast. Building on Core AI now positions your app to adopt future improvements with minimal rework because you are on Apple's first-party framework rather than a third-party runtime that may not keep pace.

If you want to skip the learning curve and ship an on-device LLM feature in weeks instead of months, our team has done this across healthcare, finance, and productivity apps. Book a free strategy call and we will scope the architecture, model selection, and timeline for your specific use case.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

Apple Core AI framework on-device LLM iOSon-device LLM inferenceApple Silicon AI optimizationSwift AI developmentiOS language model integration

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started