How to Build·14 min read

How to Build a Vision Pro Spatial Computing App From Scratch

The Vision Pro app ecosystem is still wide open, with fewer than 3,000 apps in the store. Here is how to build a spatial computing app that actually ships.

Nate Laquis

Nate Laquis

Founder & CEO

The visionOS Landscape in 2026: Wide Open Territory

Apple Vision Pro launched in early 2024, and two years later the ecosystem is still remarkably thin. As of early 2026, there are roughly 3,000 apps built specifically for visionOS. Compare that to the 1.8 million iOS apps or even the 50,000+ apps on Meta Quest, and you start to see the opportunity. The spatial computing gold rush has not happened yet. That is either a warning sign or the biggest whitespace opportunity in consumer tech, depending on how you read it.

We read it as opportunity. Apple's installed base is growing steadily with the Vision Pro 2 rumors heating up and enterprise adoption accelerating in sectors like healthcare, architecture, and industrial training. The companies building spatial apps right now are establishing category dominance before the mainstream wave hits. If you remember how early iPhone app developers captured entire verticals before competition existed, the same playbook applies here.

The barrier to entry is real, though. visionOS development requires Swift expertise, 3D content pipelines, and an understanding of spatial interaction paradigms that most mobile teams have never touched. Hardware costs $3,499 for the device itself, Xcode and Reality Composer Pro have steep learning curves, and testing on a simulator only gets you about 60% of the way to production confidence. This guide covers every step of building a Vision Pro app from zero to App Store submission, including the pitfalls we have encountered shipping spatial apps for clients at Kanopy.

Apple devices and mobile hardware representing spatial computing development platforms

Development Tools: Your visionOS Toolkit

Building for Vision Pro requires a specific set of Apple-provided tools, and getting comfortable with all of them before writing your first line of spatial code will save you weeks of frustration. Here is what you need and why each piece matters.

Xcode 16+ with visionOS SDK

Xcode is your primary IDE, and Apple has added visionOS-specific project templates, a spatial computing simulator, and debugging tools for 3D scenes. You need a Mac with Apple Silicon (M1 or later) running macOS 15+. The visionOS SDK ships with Xcode, so there is no separate download. Create a new project, select the visionOS template, and you get a working app with a window, a volume, or an immersive space depending on which starter you choose.

Reality Composer Pro

This is Apple's visual editor for 3D scenes, materials, and particle effects. Think of it as Interface Builder for spatial content. You assemble USDZ assets, apply physically-based materials, set up lighting, and define spatial audio sources. The output is a RealityKit scene that your Swift code can load and manipulate at runtime. Reality Composer Pro also handles shader graph editing, which is critical for custom visual effects that run on the M2 chip's GPU.

RealityKit

RealityKit is the rendering and simulation framework for visionOS. It handles entity-component architecture, physics, spatial audio, skeletal animation, and hand/eye interaction. Unlike SceneKit (which Apple has deprecated for new spatial work), RealityKit is built specifically for the mixed reality context. It understands room boundaries, surface detection, and object occlusion out of the box. Your 3D entities exist in the real world, not just in a virtual scene.

SwiftUI for Spatial Computing

SwiftUI on visionOS extends the familiar declarative UI framework with spatial modifiers. You can place windows in 3D space, add depth effects, create volumetric content, and transition between 2D and 3D contexts. The key additions are WindowGroup (for flat UI), Volume (for bounded 3D content), and ImmersiveSpace (for full environment experiences). If your team already knows SwiftUI for iOS, the spatial extensions feel natural.

ARKit on visionOS

ARKit provides world understanding: plane detection, mesh reconstruction, scene geometry, image tracking, and skeletal hand tracking. On Vision Pro, ARKit feeds data to RealityKit automatically for occlusion and physics, but you can also access raw data for custom interactions. The hand tracking API gives you 27 joint positions per hand at 90Hz, which is more than enough for gesture recognition systems.

App Architecture: Windows, Volumes, and Immersive Spaces

visionOS apps are not monolithic 3D experiences. Apple designed a spectrum of spatial presence, and understanding the three app types is fundamental to making the right architectural decisions for your product.

Windows

A window on visionOS looks like a floating iPad screen in your physical space. It renders standard SwiftUI views with a glass-like background material. Windows are the simplest entry point for existing iOS apps because your UIKit or SwiftUI code largely works unchanged. Users can resize, reposition, and stack windows. If your app is primarily 2D content with some spatial flourishes, windows are your foundation. Think: productivity tools, reading apps, dashboards, video players.

Volumes

A volume is a bounded 3D container that floats in the user's space. It can display RealityKit entities, 3D models, animations, and interactive objects within a defined bounding box. Volumes are perfect for product configurators, molecular visualizers, architectural models, or any experience where users inspect a 3D object from multiple angles. The key constraint is that volumes stay within their bounds. Users can reposition them but content does not spill into the room.

Immersive Spaces

An immersive space breaks free of boundaries and places content anywhere in the user's physical environment, or replaces the environment entirely. Apple offers three immersion styles: mixed (content overlaid on passthrough), progressive (a portal that expands as the user engages), and full (completely replaces the real world). Immersive spaces are where Vision Pro truly differentiates from a flat screen. Training simulations, virtual showrooms, collaborative design environments, and games all belong here.

Most production apps combine these modes. A real estate app might start with a window showing listings, transition to a volume for a 3D model preview, and then open an immersive space for a full walkthrough. The transitions between modes are managed through SwiftUI's environment and scene lifecycle, and Apple provides smooth system-level animations between them.

One architectural decision you need to make early: will your app use a single scene or multiple scenes? Multi-scene apps can present windows and volumes simultaneously, which is powerful for productivity workflows but adds complexity to state management. For most first-time visionOS projects, we recommend starting with a single-scene architecture and adding scenes only when the UX clearly demands it.

Development team collaborating on spatial computing app architecture in a modern office

Hand Tracking and Eye Tracking: Interaction Without Controllers

Vision Pro has no controllers. Every interaction happens through hands and eyes, which is both liberating and deeply challenging from a UX perspective. Your users pinch, tap, drag, rotate, and zoom using natural hand gestures, and the system tracks their gaze to determine intent. Getting interaction design right is the difference between a spatial app that feels magical and one that feels frustrating.

Eye Tracking as Intent

The eye tracking system on Vision Pro is privacy-preserving by design. Your app never gets raw gaze coordinates. Instead, the system uses eye position internally to determine which UI element the user is looking at, and delivers hover states and tap targets accordingly. From your code's perspective, you handle the same gesture recognizers you would on iOS: TapGesture, DragGesture, RotateGesture, MagnifyGesture. The difference is that targeting happens via gaze rather than finger position on glass.

This means your interactive elements need generous hit targets. Apple recommends a minimum of 60 points for tap targets in spatial UI, compared to 44 points on iOS. Eye tracking has inherent imprecision (roughly 1-2 degrees of angular error), so cramming buttons together will frustrate users. Space your controls liberally and use hover states to confirm what the user is targeting before they commit to a tap.

Hand Tracking for Direct Manipulation

For immersive experiences, you can access ARKit's hand tracking API directly. This gives you per-joint positions for all 27 joints on each hand, updated at 90Hz. You can build custom gestures: a thumbs-up for confirmation, a palm-out for stop, a point for selection, or any hand pose your experience requires. The hand tracking works up to about 1.5 meters from the user's face, which defines your comfortable interaction zone.

Custom gesture recognition typically involves comparing joint positions against known poses using simple distance thresholds or, for more complex gestures, a Core ML classifier trained on hand pose data. We have found that a training set of 200-300 samples per gesture is sufficient for reliable recognition when combined with temporal smoothing.

Interaction Design Principles

After shipping several Vision Pro apps, here are the interaction patterns we have found work best:

  • Indirect manipulation for UI: Let users look at buttons and pinch to tap. Do not make them reach out and physically press floating buttons. It causes arm fatigue within minutes.
  • Direct manipulation for objects: When users interact with 3D objects in a volume or immersive space, let them grab and rotate with natural hand movements. This feels intuitive and satisfying.
  • Progressive disclosure: Do not present 20 controls at once. Use gaze-triggered panels that appear when users look at specific regions of your interface.
  • Audio feedback: Spatial audio cues confirming interactions compensate for the lack of haptic feedback. A subtle click sound at the correct 3D position reinforces that an action registered.

3D Content Pipeline: From Blender to Vision Pro

Unless your app is purely 2D windows, you need 3D assets. The content pipeline for visionOS centers on USDZ (Universal Scene Description, zipped), Apple's preferred format for 3D content. Getting assets from creation tools into your app efficiently is one of the biggest practical challenges teams face.

The USDZ Format

USDZ is a single-file archive containing geometry, materials, textures, animations, and scene hierarchy in Apple's subset of Pixar's USD format. It supports physically-based rendering (PBR) materials, skeletal animation, blend shapes, and spatial audio attachment points. The format is well-suited for Vision Pro because RealityKit can load and render USDZ files with zero conversion overhead at runtime.

Blender to Reality Composer Pro Workflow

Most teams we work with use Blender for 3D asset creation because it is free, powerful, and has an enormous community. The pipeline looks like this:

  • Model in Blender: Create your geometry, UV unwrap, and apply PBR materials using Blender's Principled BSDF shader (which maps cleanly to USDZ's material model).
  • Export as USDZ: Blender's built-in USD exporter handles geometry, materials, and animations. Keep texture sizes reasonable: 2048x2048 max for most assets, 4096x4096 only for hero elements that users will inspect closely.
  • Import into Reality Composer Pro: Open your USDZ in Reality Composer Pro to verify materials render correctly under visionOS lighting. Adjust material properties if needed, as Blender's lighting model differs slightly from RealityKit's.
  • Add behaviors and physics: Reality Composer Pro lets you attach collision shapes, physics bodies, and interaction behaviors without writing code. For simple interactions (tap to animate, drag to rotate), this visual approach is faster than coding.
  • Reference in Xcode: Add the Reality Composer Pro project to your Xcode workspace. RealityKit loads scenes by name at runtime, and you get compile-time validation that your asset references are correct.

Performance Budgets

Vision Pro renders at 23 million pixels per frame (two 4K displays at 90Hz). That sounds powerful, but the M2 chip has thermal constraints inside a headset form factor. Apple recommends staying under 150,000 triangles per frame for smooth performance, with no more than 50 draw calls. For context, a detailed character model might be 30,000 triangles, so you have room for a moderately complex scene but not a AAA game environment.

Texture memory is your other constraint. The M2 has 16GB unified memory shared between the system, your app, and background processes. Budget 2-3GB for your app's total texture footprint. Use texture atlasing and mipmapping aggressively, and consider Apple's texture compression formats (ASTC) to reduce memory usage by 4-8x compared to uncompressed RGBA.

Code editor displaying Swift and RealityKit code for visionOS spatial computing development

On-Device AI with the M5 Neural Engine

The Vision Pro 2 (expected late 2026) will ship with Apple's M5 chip, which features a Neural Engine roughly 50% faster than the M4 generation. But even on the current M2-based Vision Pro, on-device machine learning is a core capability that separates compelling spatial apps from static 3D viewers.

Core ML on visionOS

Core ML runs inference directly on the Neural Engine, GPU, or CPU depending on model characteristics and available resources. For spatial computing, the most impactful ML use cases are:

  • Object recognition: Identify real-world objects through the passthrough cameras and overlay contextual information. A maintenance app can recognize machinery components and display repair procedures spatially anchored to the actual part.
  • Custom gesture classification: Train a model on hand pose sequences to recognize domain-specific gestures. Medical training apps use this for surgical gesture assessment.
  • Spatial audio enhancement: ML models can enhance spatial audio positioning and noise cancellation for collaborative environments where multiple users are speaking.
  • Real-time style transfer: Apply artistic or informational overlays to the passthrough view. Architecture apps use this to show how a renovation would look overlaid on the current room.

Create ML and Model Optimization

Apple's Create ML tool lets you train classification and object detection models directly on Mac without deep ML expertise. For more complex models, you can convert PyTorch or TensorFlow models to Core ML format using coremltools. The key optimization step is quantization: converting 32-bit float weights to 16-bit or even 8-bit integers, which reduces model size by 2-4x and inference time proportionally, with minimal accuracy loss for most tasks.

On the M5 Neural Engine, we expect to see models with 1-2 billion parameters running at interactive speeds (under 50ms per inference). This opens the door for on-device LLM assistants, real-time scene understanding, and generative content creation without round-tripping to cloud servers. The privacy implications are significant: enterprise customers in healthcare and defense are specifically seeking spatial apps that process sensitive data entirely on-device.

Combining ARKit and Core ML

The most powerful pattern we have seen in production is feeding ARKit's scene understanding data into custom ML models. ARKit gives you room geometry, detected planes, object positions, and hand poses. A Core ML model can interpret this composite scene data to infer user intent, predict next actions, or classify the environment type. One client's industrial training app uses this pattern to detect when a trainee's hand position relative to equipment is unsafe and provides real-time spatial warnings anchored to the danger zone.

Testing, Debugging, and Distribution

Testing spatial apps is fundamentally different from testing mobile apps. You cannot simply tap through screens on a simulator and call it done. Spatial interactions, depth perception, comfort, and real-world environment variability all require physical testing that no simulator can replicate.

The Simulator vs. Reality Gap

Xcode's visionOS Simulator lets you build and run apps without a physical device. It renders your windows, volumes, and immersive spaces in a virtual room, and you can simulate hand gestures with mouse clicks. This is fine for layout work, state management debugging, and basic interaction flow validation. However, the simulator cannot test:

  • Eye tracking precision: The simulator uses mouse hover as a proxy for gaze, but real eye tracking has different characteristics (saccades, drift, calibration variation between users).
  • Depth perception and comfort: Content that looks fine on a 2D screen can cause eye strain or feel uncomfortably close in a headset. Minimum comfortable distance for content is about 1 meter from the user.
  • Hand tracking reliability: Real hands vary in size, skin tone, lighting conditions, and occlusion. The simulator assumes perfect tracking.
  • Thermal performance: The headset throttles CPU/GPU under sustained load. Your app might run perfectly in the simulator and stutter on device after 10 minutes.
  • Passthrough quality: Mixed reality experiences depend on the passthrough cameras. Color matching, latency, and edge detection all affect how convincingly your virtual content integrates with the real world.

Our recommendation: use the simulator for 60% of development (layout, logic, asset iteration) and reserve 40% of your development time for on-device testing. Budget $3,499 for at least one Vision Pro unit per development team.

Debugging Tools

Xcode provides spatial-specific debugging overlays: wireframe rendering, collision shape visualization, draw call counts, and GPU/CPU frame time graphs. Reality Composer Pro has a "Validate" feature that checks your scenes for common issues: oversized textures, excessive polygon counts, and missing collision shapes. Instruments on macOS can profile visionOS apps for memory leaks, thread contention, and thermal state, though you need a USB-C connection to the headset for real-time profiling.

Distribution Paths

Getting your app to users follows familiar Apple patterns with a few spatial-specific considerations:

  • App Store: Standard review process. Apple is strict about comfort guidelines: your app must not cause motion sickness, must respect the user's physical space boundaries, and must provide clear entry/exit from immersive experiences.
  • TestFlight: Works identically to iOS. Up to 10,000 external testers. This is your best path for beta validation with real users in real environments.
  • Enterprise deployment: Apple Business Manager supports visionOS MDM. Companies can distribute internal apps to managed devices without App Store review. This is the primary channel for industrial training and healthcare applications.
  • Unity apps: If you built with Unity's PolySpatial framework, distribution follows the same paths. Apple treats Unity-based visionOS apps identically in review.

For a deeper breakdown of costs and timelines, see our guide on how much it costs to build a Vision Pro app. If you are also evaluating Meta Quest as a platform, our React Native Meta Quest guide covers the alternative ecosystem.

Cost, Timeline, and Getting Started

Let us talk real numbers. Building a polished Vision Pro app that ships to the App Store is not cheap, but it is more accessible than most teams expect if you scope correctly and choose the right architecture from day one.

Realistic Cost Ranges

  • Simple windowed app (existing iOS port): $15K to $40K. If you already have a SwiftUI iOS app, adapting it for visionOS windows with some spatial enhancements takes 3-6 weeks.
  • Volume-based product viewer or configurator: $50K to $90K. This includes 3D asset creation, interaction design, and performance optimization. Timeline is 8-12 weeks.
  • Full immersive experience: $100K to $200K+. Complex immersive apps with custom hand interactions, on-device ML, spatial audio, and multi-user features take 12-20 weeks and require a team with 3D pipeline expertise.

These ranges assume a team that already has Swift and RealityKit experience. If your team is learning visionOS from scratch, add 30-40% for ramp-up time and the inevitable architectural pivots that come from discovering platform constraints the hard way.

Team Composition

A typical Vision Pro project at Kanopy involves:

  • 1 senior Swift/visionOS engineer (full-time)
  • 1 3D artist/technical artist (half-time, front-loaded)
  • 1 UX designer with spatial computing experience (quarter-time throughout)
  • 1 project lead for scope management and Apple review preparation

The 3D artist role is often underestimated. Even if your app is not a game, you need someone who understands USDZ optimization, PBR materials, texture budgets, and the visual language of spatial computing. Developers who try to handle 3D assets themselves typically produce results that look amateurish on a device where Apple's own apps set an extremely high visual bar.

Where to Start Right Now

If you are serious about building for Vision Pro, here is the sequence we recommend:

  • Week 1: Download Xcode, run Apple's sample visionOS projects in the simulator, and complete Apple's "Develop for visionOS" learning path on their developer site.
  • Week 2: Build a simple volume-based app that loads a USDZ model and supports basic rotation/zoom gestures. This exercises the full pipeline without overwhelming complexity.
  • Week 3: Get hands on a physical device. Test your prototype, experience the comfort issues firsthand, and calibrate your intuition about depth, scale, and interaction zones.
  • Week 4: Define your real product scope based on what you have learned. Cut features that require capabilities you are not confident about, and build a timeline that accounts for the testing overhead spatial apps require.

The spatial computing market is still forming. Companies that ship polished Vision Pro apps in 2026 will own their categories before competition arrives. The tools are mature enough, the platform is stable, and Apple is actively promoting spatial apps with featured placement. For more context on budgeting across AR and VR platforms, check our breakdown of AR/VR app development costs.

If your team wants to move fast without the ramp-up pain, we have shipped spatial apps for enterprise and consumer clients on both Vision Pro and Meta Quest. Book a free strategy call and we will map out your architecture, timeline, and budget in 30 minutes.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

build Vision Pro appspatial computing developmentvisionOS app developmentApple Vision Pro SDK3D app development

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started