---
title: "Edge AI for Mobile Apps: On-Device Models That Ship in 2026"
author: "Nate Laquis"
author_role: "Founder & CEO"
date: "2028-06-08"
category: "Technology"
tags:
  - edge AI mobile apps
  - on-device AI models
  - CoreML deployment
  - TensorFlow Lite mobile
  - ONNX Runtime Mobile
  - ExecuTorch
  - NPU inference
excerpt: "On-device AI has moved from research demos to production features shipping in millions of apps. Here is a practical guide to the model formats, optimization techniques, and implementation patterns that actually work in 2026."
reading_time: "16 min read"
canonical_url: "https://kanopylabs.com/blog/edge-ai-for-mobile-apps-2026"
---

# Edge AI for Mobile Apps: On-Device Models That Ship in 2026

## The State of On-Device AI in 2026

Two years ago, running a useful AI model on a phone meant tolerating slow inference, battery drain, and models so small they were barely functional. That era is over. The hardware caught up, the tooling matured, and the ecosystem converged around a handful of runtimes that work in production.

Apple Intelligence now ships a 3-billion-parameter language model on every iPhone 16 and later, handling summarization, entity extraction, and basic reasoning without a network request. Google followed with Gemini Nano on Tensor G4/G5 chips, available to third-party Android developers through the AI Core API. Samsung's Galaxy AI uses a mix of on-device Gemini Nano and cloud fallback for translation and photo editing.

The hardware driving this is the Neural Processing Unit (NPU). Apple's A18 Pro delivers 35 TOPS (trillion operations per second). Qualcomm's Snapdragon 8 Elite hits 45 TOPS with its Hexagon NPU. MediaTek's Dimensity 9400 reaches 40 TOPS. These translate directly into running 1B-3B parameter models at interactive speeds: 10-30ms for classification, 50-150ms for generative text.

![Modern smartphones and tablets running on-device AI models for mobile applications](https://images.unsplash.com/photo-1512941937669-90a1b58e7e9c?w=800&q=80)

For developers, the practical implication is straightforward: if your AI feature involves classifying inputs, detecting objects, processing text under 500 tokens, or running a small generative model, you can do it on-device today with good performance on any flagship phone from the last two years. The question is no longer "can we run this on-device?" but "should we, and with which runtime?"

## Model Formats and Runtimes: CoreML, TFLite, ONNX, and ExecuTorch

The runtime you choose determines your performance ceiling, your platform reach, and how much pain you will endure during deployment. Here is what each option actually looks like in production.

### CoreML (Apple Platforms)

CoreML is Apple's on-device inference framework, and it remains the fastest path to the Neural Engine on iOS, iPadOS, macOS, and visionOS. CoreML 7 (shipping with iOS 19) added support for stateful models, which means you can run multi-turn language models without reloading context on every inference call. Model conversion from PyTorch is handled by **coremltools**, which has improved significantly but still chokes on certain custom operators. Expect to spend 2-4 hours debugging conversion issues for anything beyond a standard vision or NLP model.

Performance is excellent. A MobileNetV3 image classifier runs in 3ms on the A18 Pro Neural Engine. A 1.5B-parameter language model generates tokens at roughly 12 tokens per second on an iPhone 16 Pro. The limitation is platform lock-in: CoreML models only run on Apple hardware.

### TensorFlow Lite (TFLite)

TFLite has been the workhorse of on-device ML on Android for years. It supports GPU delegate acceleration on most Android devices and the NNAPI delegate for NPU access. Model conversion from TensorFlow and Keras is mature and well-documented. The ecosystem includes pre-trained models for common tasks (object detection, pose estimation, text classification) through the TF Hub.

In 2026, TFLite remains the safe choice for Android. A SSD MobileNet object detection model runs at 25-30 FPS on a Snapdragon 8 Gen 3 using the GPU delegate. The downside is that Google has been shifting investment toward Gemini Nano and the AI Core API, and TFLite's update cadence has slowed compared to 2023-2024.

### ONNX Runtime Mobile

ONNX Runtime Mobile is the cross-platform option. It runs on iOS, Android, Windows, macOS, and Linux. Microsoft has invested heavily in its mobile execution providers, including CoreML (for iOS NPU access), NNAPI (for Android NPU access), and XNNPACK (for CPU fallback). The major advantage is a single model format that deploys everywhere.

The tradeoff is performance overhead. ONNX Runtime Mobile is typically 10-20% slower than native CoreML on iOS and 5-15% slower than TFLite with tuned delegates on Android. For many apps, that difference is irrelevant. For real-time camera processing at 30 FPS, it can matter. Conversion from PyTorch to ONNX is straightforward using **torch.onnx.export**, though dynamic shapes and certain attention mechanisms require careful handling.

### ExecuTorch (Meta/PyTorch)

ExecuTorch is the newest entrant, released by Meta as part of PyTorch 2.x. It is designed specifically for on-device inference with a focus on LLMs and generative models. ExecuTorch supports quantization-aware export directly from PyTorch, which eliminates the conversion step that plagues other runtimes. It includes delegates for Apple's CoreML, Qualcomm's QNN, and MediaTek's NeuroPilot.

ExecuTorch is the best option if you are deploying a custom-trained PyTorch model and want to avoid the ONNX or CoreML conversion pipeline entirely. The tooling is still less mature than CoreML or TFLite (expect rougher documentation and fewer Stack Overflow answers), but the trajectory is strong. Meta runs ExecuTorch in production across Instagram, WhatsApp, and Facebook for on-device ranking and content understanding models.

## Use Cases That Work On-Device vs. Cloud

Not every AI feature belongs on the device. The decision comes down to model size, latency requirements, privacy constraints, and how often the model needs updating. Here is where the line falls in practice.

### On-Device Winners

- **Image classification and object detection:** Models like MobileNetV3 and EfficientDet-Lite are under 15MB, run in single-digit milliseconds, and handle most classification tasks with 90%+ accuracy. If you are identifying products, scanning documents, or detecting objects in a camera feed, on-device is the default choice.

- **Text classification and sentiment analysis:** DistilBERT-based models quantized to INT8 are under 25MB and classify text in 5-10ms. Spam detection, content categorization, and intent classification all run well on-device.

- **Keyword spotting and voice commands:** Wake-word detection and simple voice command recognition must run on-device. Any network latency here makes the feature feel broken.

- **On-device search and ranking:** Embedding models that encode user queries and content into vectors for local similarity search. This powers features like "search your photos" or "find similar items" without sending personal data to a server.

### Cloud-Only (For Now)

- **Large language model inference beyond 3B parameters:** GPT-4-class reasoning, long-context analysis, and complex multi-step generation still require cloud infrastructure. A 7B model can technically run on a flagship phone, but at 2-3 tokens per second with significant battery drain, the user experience is poor.

- **Image generation above 512x512:** Stable Diffusion on-device takes 15-30 seconds per image on current hardware. Cloud generation takes 2-4 seconds. Users notice.

- **RAG with large knowledge bases:** If your retrieval corpus is larger than what fits in on-device storage (typically 100MB-1GB of embeddings), you need cloud infrastructure. As we covered in our [on-device AI vs cloud AI comparison](/blog/on-device-ai-vs-cloud-ai), the hybrid approach often works best here.

### The Hybrid Zone

Most production apps land somewhere in between. A common pattern is on-device inference for the fast path (classify, detect, filter) with cloud fallback for the hard cases (generate, reason, retrieve). For example, an e-commerce app might use an on-device model to detect product categories from photos and then call a cloud API only when the user asks for a detailed product description or comparison.

## Model Optimization: Quantization, Pruning, and Distillation

Raw models from training are almost never ready for on-device deployment. A PyTorch model trained on a cloud GPU with FP32 weights will be 4x larger than it needs to be and run 3-5x slower than an optimized version. Here are the techniques that matter, in order of impact.

### Quantization

Quantization reduces the precision of model weights from 32-bit floating point (FP32) to lower precision formats. The most common target is INT8 (8-bit integers), which cuts model size by 75% and speeds up inference by 2-4x on hardware with INT8 acceleration (which includes every modern phone NPU).

Post-training quantization (PTQ) is the easiest path. Run a calibration dataset through your trained FP32 model, and the quantization tool maps weights to INT8 ranges. CoreML, TFLite, and ONNX Runtime all support PTQ with minimal code. Accuracy loss is typically 0.5-2% for classification and 1-3% for detection.

Quantization-aware training (QAT) simulates low-precision arithmetic during training, letting the model compensate for quantization error. QAT recovers 50-80% of accuracy lost in PTQ. It requires retraining, but for models where every percentage point matters, it is worth the effort.

For LLMs deployed on-device, 4-bit quantization (INT4 or GPTQ/AWQ formats) is now standard. A 3B-parameter model at FP16 is roughly 6GB. At INT4, it shrinks to 1.5GB, which fits comfortably in the memory of any flagship phone. The accuracy tradeoff is real but manageable for most consumer-facing tasks.

![Developer working on model optimization code for mobile AI deployment](https://images.unsplash.com/photo-1555949963-ff9fe0c870eb?w=800&q=80)

### Pruning

Pruning removes weights (or entire neurons/channels) that contribute little to the model's output. Structured pruning, which removes entire channels or attention heads, is more practical for mobile because it reduces actual computation rather than just creating sparse matrices that hardware cannot exploit efficiently. A typical structured pruning pipeline removes 30-50% of channels with 1-2% accuracy loss, then fine-tunes the pruned model to recover performance.

### Knowledge Distillation

Distillation trains a small "student" model to mimic a large "teacher" model. The student learns the teacher's probability distributions across all classes, not just correct answers, which transfers more information than hard labels alone. This is how Apple created the 3B-parameter models powering Apple Intelligence.

Distillation is the most effective technique when you need to cross a size threshold. If your cloud model is 1B parameters and you need to get under 100M for on-device deployment, pruning and quantization alone will not get you there. Distillation lets you design an architecture that fits your hardware budget and train it to perform almost as well as the original.

### ONNX Conversion Pipeline

For teams targeting multiple platforms, the typical pipeline is: train in PyTorch, export to ONNX, optimize with ONNX Runtime's tools (graph optimizations, operator fusion), quantize to INT8, then deploy with ONNX Runtime Mobile or convert to CoreML (**onnx-coreml**) and TFLite (**onnx-tf**). This adds 1-2 days of engineering work but eliminates maintaining separate training pipelines per platform.

## Implementation with React Native and Flutter

Cross-platform frameworks add a layer between your app and the on-device runtime. That layer introduces both complexity and opportunity. Here is how to integrate on-device models in the two dominant cross-platform frameworks.

### React Native

React Native does not have a built-in ML inference API. You have three practical options. First, use a native module that wraps CoreML on iOS and TFLite on Android. Libraries like **react-native-fast-tflite** (by Marc Rousavy) provide a JSI-based bridge that avoids the old bridge serialization overhead. This is the highest-performance option: inference calls go from JavaScript to native code through a synchronous C++ interface, with no JSON serialization penalty.

Second, use ONNX Runtime for React Native (**onnxruntime-react-native**), which Microsoft maintains officially. This gives you a single model format across both platforms at the cost of the 10-20% performance overhead mentioned earlier. For apps where inference time is measured in tens of milliseconds rather than single digits, this overhead is invisible to users.

Third, for vision tasks, use the device's camera with a frame processor (via **react-native-vision-camera**) and run inference on each frame in a native module. This pattern avoids round-tripping pixel data through JavaScript entirely. The frame processor runs on a native thread, calls your model, and posts results back to JS. We use this pattern for barcode scanning, document detection, and real-time object recognition in client apps. As we detailed in our [Apple Intelligence SDK guide](/blog/apple-intelligence-sdk-on-device-ai-guide), the Neural Engine integration through CoreML frame processors is particularly efficient on recent iPhones.

### Flutter

Flutter's situation is similar but with different tooling. The **tflite_flutter** plugin provides TFLite integration for both platforms. Google also maintains **google_mlkit**, which wraps ML Kit for common tasks like face detection, barcode scanning, text recognition, and pose estimation. ML Kit handles model management, hardware acceleration, and fallback logic, making it the fastest path to shipping if your use case matches a pre-built API.

For custom models, package the model file as an asset, load it at startup (or lazily on first use), and run inference through a platform channel or Dart FFI binding. Dart's FFI support now lets you call C/C++ inference code directly without a platform channel, reducing latency by 1-3ms per call.

### Model Bundling and Updates

A critical decision for any cross-platform app is whether to bundle the model in the app binary or download it after install. Bundling guarantees the model is available on first launch but increases your app size, which directly impacts download conversion rates. Google's research shows that every 6MB increase in APK size reduces install conversion by 1%.

The alternative is downloading models on first launch or on demand. Both CoreML and TFLite support this. The tradeoff: the feature is unavailable until the download completes, and you need to handle versioning, cache invalidation, and partial download recovery. For models under 25MB, bundle them. For models over 50MB, download them. For sizes in between, test your install conversion rate and decide based on data.

## Performance Benchmarks and Battery Impact

Benchmarks without hardware context are meaningless. Here are numbers from production deployments, measured on real devices with thermal throttling and background processes running.

### Inference Latency

- **MobileNetV3 image classification (INT8, 5MB):** iPhone 16 Pro (CoreML, Neural Engine): 2.8ms. Pixel 9 Pro (TFLite, GPU delegate): 4.1ms. Samsung Galaxy S25 Ultra (ONNX Runtime, NNAPI): 3.9ms.

- **EfficientDet-Lite2 object detection (INT8, 12MB):** iPhone 16 Pro: 8.2ms. Pixel 9 Pro: 11.4ms. Galaxy S25 Ultra: 10.8ms.

- **DistilBERT text classification (INT8, 22MB):** iPhone 16 Pro: 6.5ms. Pixel 9 Pro: 9.3ms. Galaxy S25 Ultra: 8.7ms.

- **Gemma 2B language model (INT4, 1.2GB):** iPhone 16 Pro: 9 tokens/sec. Pixel 9 Pro (via AI Core): 7 tokens/sec. Galaxy S25 Ultra: 8 tokens/sec.

### Battery Impact

This is where on-device AI gets tricky. The NPU is far more power-efficient than CPU or GPU for the same computation, but it still consumes energy. Our production measurements show these patterns.

For intermittent inference (classification triggered by user action, 10-50 times per session), battery impact is negligible. We measured less than 0.5% additional battery drain per hour compared to the same app without the ML feature. Users will never notice.

For continuous inference (real-time camera processing at 30 FPS), battery impact is significant. On an iPhone 16 Pro running a MobileNet classifier on every camera frame, we measured 8-12% additional battery drain per hour. On Android devices, the range was 10-15%, varying by chipset and delegate configuration. This means a feature like continuous barcode scanning or real-time translation will visibly reduce battery life during active use.

For generative models running sustained inference (long text generation, extended conversation), battery drain is 15-25% per hour of continuous use. This is why both Apple Intelligence and Gemini Nano implement aggressive session timeouts and favor short, bursty interactions.

![Server infrastructure representing cloud AI inference comparison with on-device processing](https://images.unsplash.com/photo-1504868584819-f8e8b4b6d7e3?w=800&q=80)

### Memory Footprint

On-device models share RAM with the rest of the app and the OS. iOS terminates your app if it exceeds roughly 1.4GB on an iPhone 16 (less on older devices). A quantized 2B-parameter model consumes 800MB-1.2GB during inference, leaving little headroom for UI, image caches, and navigation stacks.

The practical ceiling is one large model (under 1GB in memory) or multiple small models (under 100MB each). If you need multiple models running simultaneously, total memory should stay under 400-500MB to avoid crashes on mid-range devices.

## Decision Framework: On-Device vs. Cloud vs. Hybrid

After shipping on-device AI features in dozens of apps, we have distilled the decision into a framework that prevents the two most common mistakes: defaulting to cloud when on-device would be cheaper and faster, or forcing on-device when the model simply does not fit.

### Choose On-Device When

- **Latency must be under 50ms:** Real-time camera processing, voice keyword detection, and interactive text features all require sub-50ms inference. Cloud cannot reliably deliver this.

- **Privacy is a hard requirement:** Healthcare (HIPAA), financial data, biometric processing, or any feature where user data must not leave the device. On-device is not just preferable here; it may be legally required.

- **Offline capability matters:** Field workers, travelers, users in areas with poor connectivity. If the feature must work without internet, on-device is the only option.

- **Per-inference cost must be zero:** Features with high inference volume (every keystroke, every camera frame, every sensor reading) become expensive at cloud pricing. On-device inference has zero marginal cost after the initial engineering investment.

### Choose Cloud When

- **Model size exceeds 3B parameters:** If your task requires GPT-4-class reasoning, RAG over large corpora, or complex multi-modal understanding, cloud is the only viable option today.

- **Models update frequently:** If you retrain your model weekly or daily (for example, a recommendation model that incorporates fresh user behavior), pushing updates to on-device models is slow and unreliable. Cloud lets you swap models instantly.

- **You need consistent behavior across all devices:** On-device inference can vary by chipset, OS version, and available memory. Cloud inference is deterministic. For regulated industries where model behavior must be auditable and consistent, cloud is safer.

- **Development timeline is under 4 weeks:** On-device deployment, optimization, and testing across device variants takes time. Cloud deployment with a simple API call ships faster.

### Choose Hybrid When

Most production apps should use a hybrid approach. Run a small, fast model on-device for the common case and fall back to a cloud model for edge cases or complex requests. This gives you the latency and privacy benefits of on-device for 80-90% of interactions while preserving access to more capable models when needed.

The implementation cost is real: two inference paths, fallback logic, and consistent UX across switching points. Budget 3-5 additional engineering weeks for hybrid compared to a pure on-device or pure cloud approach.

### Cost Comparison

For a production app with 100,000 monthly active users averaging 50 AI interactions per user per month, the cost breakdown looks like this. Cloud inference (using a hosted model at $0.001 per inference): $5,000 per month, $60,000 per year. On-device inference: $0 per month in inference costs, but $30,000-$80,000 in upfront engineering for model optimization, testing across devices, and integration. The breakeven point is typically 6-12 months, after which on-device is dramatically cheaper. For apps with more than 500,000 MAU, on-device pays for itself in under three months.

Building an on-device AI feature in 2026 is no longer a research project. The hardware is capable, the runtimes are stable, and a competent mobile team can ship a production-quality on-device model in 4-8 weeks. The key is choosing the right runtime, applying the right optimization techniques, and making the on-device vs. cloud decision based on your specific latency, privacy, and cost requirements. If you are planning an AI-powered mobile app and want to evaluate whether on-device deployment makes sense for your use case, [book a free strategy call](/get-started) and we will walk through the tradeoffs together.

---

*Originally published on [Kanopy Labs](https://kanopylabs.com/blog/edge-ai-for-mobile-apps-2026)*
