Technology·14 min read

React Native + On-Device AI: Building LLM-Powered Mobile Apps

On-device LLMs are no longer a research project. In 2026 you can run Phi-3 or Gemma 2B on a mid-range Android and get 20+ tokens per second. Here is how to wire that into a React Native app that actually ships.

Nate Laquis

Nate Laquis

Founder & CEO

The State of On-Device LLMs in React Native (2026)

Two years ago, running a language model on a phone meant 2 tokens per second, a device that got hot enough to cook an egg, and a model file that ate half the user's storage. That era is over. The combination of Apple's Neural Engine on A18-class chips, Qualcomm's Hexagon NPU on Snapdragon 8 Elite, and purpose-built inference runtimes has changed the math completely.

Today, Phi-3 Mini (3.8B parameters, 4-bit quantized) runs at roughly 28 tokens per second on an iPhone 16 Pro. Gemma 2B hits 22 tokens per second on a Pixel 9. Those numbers are fast enough for real product features: smart autocomplete, offline chat assistants, real-time translation, and on-device summarization. The model size story has also improved. A 4-bit quantized Phi-3 Mini comes in at around 2.2 GB. Gemma 2B at INT4 is about 1.4 GB. Gemini Nano, which ships pre-loaded on Pixel 9 and Galaxy S25 devices, requires zero storage budget from your app at all.

The React Native ecosystem has caught up faster than most people expected. The combination of the React Native new architecture (TurboModules, JSI, Fabric) with libraries like react-native-executorch and llama.rn means you can write JavaScript that talks directly to native inference runtimes with near-zero bridging overhead. This guide walks you through every layer of that stack, from choosing a model to shipping a production app that handles the edge cases your users will actually encounter.

Developer coding React Native app with on-device AI integration

Before going further: on-device AI is not always the right choice. If your use case requires GPT-4-class reasoning, a 500B parameter model's judgment, or real-time access to live data, you want cloud inference. On-device wins when you need offline capability, privacy (no data leaves the device), low latency for interactive features, or zero per-request cost. The hybrid architecture section below covers how to combine both intelligently.

Bridging Native AI SDKs via TurboModules and Expo Modules API

The first design decision is how to expose native AI capabilities to your JavaScript layer. You have two paths: write your own native module or use a library that already does it. In most cases, start with an existing library and write custom bridging only when you need capabilities those libraries do not expose.

Core ML (iOS)

Apple's Core ML framework is the most mature on-device AI runtime available. It handles model compilation (converting ONNX or PyTorch models to Core ML format with coremltools), hardware acceleration across CPU, GPU, and Neural Engine, and memory management. For text generation specifically, Core ML supports state caching across inference passes, which matters enormously for LLM performance. You write Swift or Objective-C that calls Core ML, then expose that via a TurboModule or Expo Module.

The TurboModule approach uses JSI (JavaScript Interface) for synchronous, near-zero-overhead calls from JavaScript into native code. No serialization to JSON, no async bridge round-trips. If you are building with bare React Native and need maximum control, TurboModules are the right choice. The setup is heavier: you write a TypeScript spec file, implement the native module in Swift and Kotlin, register it with the new architecture's codegen, and configure the build system. Plan for two to three days of setup if you have not done it before.

MediaPipe (Android and iOS)

Google's MediaPipe is the cross-platform answer. MediaPipe Tasks includes a LLM Inference API that supports Gemma, Phi, and Mistral models with a unified API across Android (Hexagon NPU, GPU) and iOS (GPU, Neural Engine). The React Native integration story is less polished than executorch, but MediaPipe's cross-platform nature reduces the native code surface you have to maintain. A single Kotlin module and a single Swift module share the same API contract, which your Expo Module wraps.

Expo Modules API

If you are using Expo (managed or bare workflow), the Expo Modules API is the fastest way to build a custom native module. It uses Swift and Kotlin natively but handles most of the boilerplate: module registration, argument conversion, promise and callback handling, event emitters. A well-structured Expo Module that wraps Core ML on iOS and MediaPipe on Android can be written in two to three days by a developer who knows native mobile development. If you are choosing your project setup, read through Expo vs bare React Native before committing to a workflow, because the Expo Modules API requires the bare workflow or a custom dev client.

Mobile devices running LLM-powered React Native applications

When to Write Custom Bridging

Write custom native modules when you need streaming token output with React Native's event emitter system, when you need to pass large tensors across the bridge without copying, or when a library does not expose a specific model architecture or quantization format you need. In all other cases, react-native-executorch or llama.rn will get you to a working prototype faster than rolling your own.

Running Small LLMs with react-native-executorch

react-native-executorch is the library you should reach for first when building on-device LLM features in React Native. It wraps Meta's ExecuTorch runtime, which supports PyTorch models exported to the .pte format. ExecuTorch has first-class support for LLaMA 2, LLaMA 3, Phi-3, and Gemma 2, with hardware-accelerated backends for Apple Neural Engine (via Core ML delegate) and Android NPU (via XNNPACK and Vulkan backends).

Installation and Setup

Start with a bare React Native project using the new architecture (RN 0.75+). Install the library, run the native builds, and make sure you have a model file in ExecuTorch's .pte format. Meta provides pre-exported LLaMA 3.2 1B and 3B models on Hugging Face. For Phi-3 and Gemma 2B, you export them yourself using ExecuTorch's export scripts, which takes about 30 minutes on a machine with a GPU.

The core API is straightforward. You call useLLM hook, provide the path to your model file, and you get back a generate function that streams tokens. react-native-executorch handles the tokenizer (loading the tokenizer model from a separate file), the KV cache, and memory management across inference calls. You do not think about any of that unless you need to tune it.

llama.rn

llama.rn is the other major option. It wraps llama.cpp, the C++ inference library that supports GGUF model files (the format used by most quantized models on Hugging Face). GGUF has a broader ecosystem of pre-quantized models than ExecuTorch's .pte format, so if you want to quickly prototype with an obscure fine-tuned model, llama.rn may have it available. The tradeoff is that llama.cpp does not use the Neural Engine on iOS. It uses METAL for GPU acceleration on iOS and OpenCL on Android, which is faster than CPU but slower than dedicated NPU acceleration. For production apps where performance matters, react-native-executorch with Core ML delegation is the better choice on iOS.

Model Selection by Use Case

  • Phi-3 Mini (3.8B INT4): Best reasoning quality per byte. Good for chat, summarization, code assist. Requires ~2.2 GB storage. Recommended for flagship iPhone and Android devices.
  • Gemma 2B (INT4): Excellent instruction following, smaller footprint (~1.4 GB). Works well on mid-range Android. Good for translation and Q&A.
  • LLaMA 3.2 1B (INT4): Sub-1 GB, runs on older hardware. Quality is noticeably lower but good enough for autocomplete and short-form generation tasks where latency matters most.
  • Gemini Nano: Zero storage cost on supported devices (Pixel 9, Samsung Galaxy S25). Access it via the Android AICore API, wrapped in a custom Expo Module. Not available on iOS or older Android devices.
Code showing native module bridging for on-device AI in React Native

Performance Benchmarks: iPhone 16 vs Pixel 9

Here are real numbers from testing in May 2026, measuring tokens per second (TPS) for text generation with a 512-token context. All models are 4-bit quantized. These numbers will vary based on prompt length, context window, and thermal state of the device.

iPhone 16 Pro (A18 Pro, Neural Engine)

  • Phi-3 Mini via react-native-executorch + Core ML: 28 TPS prefill / 24 TPS generation
  • Gemma 2B via react-native-executorch + Core ML: 35 TPS prefill / 30 TPS generation
  • LLaMA 3.2 1B via react-native-executorch + Core ML: 55 TPS prefill / 48 TPS generation
  • Phi-3 Mini via llama.rn (Metal GPU, no Neural Engine): 18 TPS generation

Pixel 9 Pro (Tensor G4, Hexagon NPU)

  • Phi-3 Mini via react-native-executorch + XNNPACK: 22 TPS prefill / 19 TPS generation
  • Gemma 2B via react-native-executorch + XNNPACK: 26 TPS prefill / 23 TPS generation
  • Gemini Nano via Android AICore: 40+ TPS (model is hardware-optimized, pre-loaded)
  • Phi-3 Mini via llama.rn (OpenCL GPU): 12 TPS generation

Mid-Range Android (Snapdragon 7s Gen 2)

  • LLaMA 3.2 1B via react-native-executorch: 10-14 TPS generation
  • Gemma 2B via react-native-executorch: 6-8 TPS generation (borderline for interactive use)

The key takeaway: for interactive chat-style features, you want at least 15 TPS to avoid visible lag. On iPhone 16 Pro, Phi-3 Mini clears that bar comfortably. On mid-range Android, you are better served by LLaMA 3.2 1B or a cloud fallback. The first-token latency (time to first token) also matters for user experience. With Core ML delegation, Phi-3 Mini produces the first token in 300-500 ms on iPhone 16 Pro, which is perceptible but acceptable for a "thinking" indicator pattern.

Thermal Performance

Sustained inference is where phones struggle most. In testing, continuous generation for 2 minutes at maximum context causes iPhone 16 Pro to throttle from 24 TPS down to 18-20 TPS as the device heats up. Budget for 20-25% throughput reduction in sustained-use scenarios. Design your UX to avoid continuous uninterrupted generation: use streaming responses with natural pauses, limit context window to what the feature actually needs, and avoid chaining multiple long generations back-to-back without delay.

Managing Model Files: Bundling, Downloading, and Storage

Model file management is one of the most underestimated challenges in shipping on-device AI. A 1.4 GB model file cannot be bundled into your app binary (App Store and Google Play have binary size limits). You need a download-on-demand strategy, and that strategy needs to handle interruptions, storage constraints, and update management.

Never Bundle Large Models in the App Binary

Apple's App Store limit for the over-the-air download size is 4 GB, but the practical limit for a smooth user experience is much lower. Android has similar constraints. More importantly, bundling the model means every user downloads it even if they never use the feature. The right approach is lazy loading: the app ships without the model, and you download it the first time the user accesses the AI feature.

Hosting Model Files

Host your model files on a CDN (CloudFront, Cloudflare R2, or Bunny CDN). GGUF and PTE files are large binary blobs that compress poorly, so CDN egress costs matter. At $0.01-$0.02 per GB for CDN egress, a 2.2 GB Phi-3 Mini download costs about $0.02-$0.04 per install. For an app with 50,000 active users who each download once, that is $1,000-$2,000 in CDN costs. Plan for it in your unit economics.

Download Management in React Native

Use react-native-background-downloader for resumable downloads. It handles download resumption after app backgrounding, network interruption recovery, and progress tracking. Store the downloaded model in the app's documents directory (iOS) or internal storage (Android), not the cache directory. Cache directories can be purged by the OS under storage pressure, and you do not want to re-download a 2 GB file because iOS cleared the cache.

Implement a clear download state machine in your app: no model downloaded, downloading (with progress), download complete, model loaded, error states. Show download progress with a progress bar. Tell users how much storage the download requires before they start. On iOS, check available storage with react-native-device-info's getFreeDiskStorage before initiating a download and surface a clear error if storage is insufficient.

Model Updates and Versioning

Version your model files with a hash or semantic version in the filename and track the installed version in AsyncStorage or Expo SecureStore. When you deploy a new model version, compare the stored version against your app's expected version and prompt the user to download the update. Do not force-delete the old model until the new one is fully downloaded and verified. A SHA-256 hash check after download confirms file integrity before you attempt to load it into the inference runtime.

Storage Budgets

Phi-3 Mini INT4 requires 2.2 GB. The tokenizer adds another 2-3 MB. On devices with 64 GB storage, that is a material ask. Consider offering the feature as opt-in and making the storage requirement explicit in your UI. For apps where offline AI is a core feature rather than an enhancement, position it as a premium feature and frame the storage requirement as part of the premium experience.

Battery and Memory Optimization

On-device LLM inference is computationally expensive. If you handle battery and memory carelessly, you will get one-star reviews that say "this app destroys my battery." Here is how to avoid that.

Memory Management

A loaded Phi-3 Mini model consumes roughly 2.5-3 GB of device RAM when the KV cache is allocated for a full context window. iPhone 16 Pro has 8 GB RAM. Budget for your app's other memory usage and make sure you are not loading the model until the user actually needs it. react-native-executorch provides explicit load and unload methods. Call unload when the user navigates away from the AI feature and the app goes to background. On iOS, failing to unload a large model after backgrounding increases your risk of being killed by the OS for excessive memory use.

The KV cache size is configurable in react-native-executorch and llama.rn. Smaller KV cache means shorter maximum context but lower memory overhead. For autocomplete features where the context is short (under 512 tokens), set the KV cache accordingly rather than allocating for 4,096 tokens you will never use.

Battery Impact

A single 200-token generation on iPhone 16 Pro draws roughly 0.3-0.5% battery capacity. For a feature that runs once per user session, that is negligible. For a feature that runs continuously (real-time translation, live transcription), it adds up quickly. Benchmark battery drain for your specific use case before shipping. Use iOS's Instruments Energy Log or Android's Battery Historian to measure impact per inference call.

Use the device's thermal state APIs to adapt your behavior. Apple provides ProcessInfo.thermalState (accessible via a native module), and Android provides PowerManager.getThermalHeadroom. When the device enters a serious or critical thermal state, pause inference, show a "cooling down" UI, and resume when the device recovers. This prevents the runaway battery drain pattern where continuous inference pushes the device into thermal throttling, which causes inference to slow down, which means the generation takes longer, which means more total energy consumed.

Batching and Scheduling

Avoid running inference during other compute-intensive operations. If your app is also doing image processing, network requests, or rendering complex UI, schedule inference to run when the device is otherwise idle. On iOS, use background task scheduling for non-interactive generation (pre-generating suggestions, pre-computing embeddings) so the work happens when the device is charging and idle.

Context Window Management

Truncate conversation history aggressively. For a chat assistant feature, you rarely need more than the last 8-10 exchanges in the context window. Keeping a 2,000-token conversation history in context increases inference time and memory use proportionally. Implement a sliding window: keep the system prompt, the most recent N exchanges, and a summarized version of older history. This keeps quality high while controlling resource use.

Hybrid Architecture: On-Device Plus Cloud Fallback

The most robust architecture for production mobile AI is hybrid: on-device inference as the primary path, cloud inference as the fallback. This gives you the privacy and latency benefits of on-device AI for most users while ensuring everyone gets a working experience regardless of their device capabilities.

Decision Logic

Decide between on-device and cloud at runtime based on several signals. If the model is not downloaded, use cloud. If the device has less than 3 GB available RAM, use cloud. If the device is in a critical thermal state, use cloud. If the task requires capabilities that your on-device model cannot handle well (multi-step reasoning, long-document analysis, complex code generation), use cloud. Implement this as a simple decision function that your inference hook calls before routing the request.

Be transparent with users about which path is active. A small indicator in your UI ("Using on-device AI" vs "Using cloud AI") builds trust and helps users understand why performance varies. Users on older devices who always see "cloud AI" may choose to upgrade. Users who value privacy will appreciate knowing their data stays on-device.

Unified API Design

Design your inference layer with a unified interface so the rest of your app code does not care whether inference runs on-device or in the cloud. A simple abstraction: an InferenceProvider interface with a generate(prompt, options) method that returns an async iterator of tokens. Your on-device provider wraps react-native-executorch; your cloud provider calls your backend which proxies to Claude or OpenAI. The UI layer streams tokens from either provider identically.

Smart Model Preloading

Do not wait until the user taps the AI feature to start loading the model. Preload it when the app launches in the background, using low-priority CPU scheduling. On iOS, use Task.detached(priority: .background) in your Swift module to load the model without blocking the main thread. By the time the user navigates to the AI feature, the model is already in memory and the first generation starts immediately. This eliminates the 5-10 second "loading model" wait that makes on-device AI feel slow.

Sync and Context Continuity

For chat features, conversation context lives on-device. If the user switches to a new device or reinstalls the app, that context is gone. Decide explicitly whether conversation history is ephemeral (on-device only) or persistent (synced to your backend). For privacy-sensitive use cases, ephemeral is a feature. For productivity tools where users expect continuity, sync the conversation metadata (but not the model) to your backend. On cloud fallback requests, include the relevant conversation history so the cloud model has the same context the on-device model would have had.

Practical Use Cases and Implementation Patterns

On-device LLMs are best suited to a specific set of use cases. Here are the patterns that work in production, with implementation notes for each.

Smart Autocomplete

Autocomplete is the highest-value on-device AI use case because it is latency-sensitive (users type fast and expect suggestions in under 200 ms), privacy-sensitive (you are reading what users type before they submit), and runs frequently. Use LLaMA 3.2 1B or a domain-fine-tuned smaller model rather than Phi-3 Mini. You do not need strong reasoning for next-word prediction. Debounce the autocomplete trigger to fire after 300 ms of no typing, use a short context window (128-256 tokens), and generate 3-5 token completions rather than full sentences. At LLaMA 3.2 1B speeds on iPhone 16, you can generate a 5-token suggestion in under 150 ms, well within user perception thresholds.

Offline Chat Assistant

For domain-specific chat assistants (customer support bots, onboarding helpers, in-app guides), on-device AI eliminates per-request API costs and works without connectivity. Fine-tune Phi-3 Mini or Gemma 2B on your domain-specific data before quantizing and deploying. Even a small fine-tuning dataset (500-1,000 high-quality examples) dramatically improves response quality for narrow domains. Host the fine-tuned model on your CDN and version it with your app updates. Update the model when your product changes significantly enough that responses would become stale.

Real-Time Translation

Real-time translation is a compelling use case for field workers, travelers, and global communication apps. Gemma 2B handles translation well for major language pairs. For real-time spoken translation, pipeline it: speech recognition via Apple's SFSpeechRecognizer or Android's SpeechRecognizer (both run on-device), pass the transcript to the LLM for translation, then text-to-speech for output. The combined latency for this pipeline on iPhone 16 Pro is around 1.5-2 seconds end-to-end, which is fast enough for turn-by-turn conversation.

Document Summarization

Summarizing documents the user has loaded locally (PDFs, notes, emails) is privacy-sensitive work where on-device AI shines. Users are understandably reluctant to send personal documents to a cloud API. With Phi-3 Mini's 4,096-token context window (in ExecuTorch's implementation), you can handle documents up to roughly 3,000 words in a single pass. For longer documents, chunk and hierarchically summarize: split into sections, summarize each section on-device, then summarize the summaries.

Embedding and Semantic Search

You do not need a full generative LLM for semantic search. Use a smaller embedding model (all-MiniLM-L6-v2 via react-native-executorch weighs under 100 MB) to generate embeddings for your local content, then do vector similarity search in-memory or with a local SQLite-based vector store. This pattern powers local-first note apps, offline document search, and personalized recommendation features with zero cloud dependency and sub-100 ms search latency.

The React Native on-device AI ecosystem is moving fast. The tooling that exists today would have been unthinkable two years ago, and the next two years will bring further improvements: smaller and more capable models, better NPU utilization, and richer React Native libraries that abstract away the native complexity. Now is the right time to build these features into your product. The competitive advantage of shipping them before your competitors do is real and measurable.

We build AI-powered React Native apps with on-device intelligence. Book a free strategy call to discuss your mobile AI project.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

react native on-device AILLM mobile developmentreact-native-executorchon-device machine learningmobile AI inference

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started