Why this category exists in the first place
Three years ago, shipping a custom AI model to production meant one of two painful paths. Either you rented a bare GPU box from a cloud provider, wrote your own queueing layer, built a container orchestration system, figured out autoscaling on your own, and prayed the instance did not fall over at 3 a.m. Or you handed your weights to a hyperscaler managed service, paid a markup that made your CFO wince, and accepted that debugging would involve opening support tickets with a vendor that barely knew what a diffusion model was.
Neither option worked for startups. The first required a platform engineer who also happened to be a CUDA specialist. The second was priced for enterprises with procurement departments. A new category emerged to fill the gap: serverless GPU platforms designed specifically for machine learning workloads. Modal, Replicate, and Baseten are the three names that keep coming up in founder conversations, and they represent three genuinely different philosophies about what model deployment should feel like.
Modal treats GPU inference like a compute primitive. You write Python, decorate a function, and Modal handles the rest. Replicate treats models like packages. You containerize your model with Cog, push it to a registry that looks a bit like Docker Hub, and anyone can run it with a single API call. Baseten treats deployment like product engineering. You package with Truss, you configure scaling rules, and the platform optimizes cold starts and throughput for you. All three bill per second of GPU time, all three support the usual accelerators, and all three can host the same underlying PyTorch weights. The differences show up in how you build, how you scale, and how much you pay when traffic spikes.
This guide compares them across pricing, cold starts, autoscaling, developer experience, and workload fit. If you are still deciding whether to deploy your own model at all, start with our piece on self-hosted LLMs versus API providers. If you are already deploying and the bill is climbing faster than your user count, our cost management guide is the better starting point.
Modal: Python native serverless GPUs
Modal's pitch is that you should never write a Dockerfile again. You describe your environment in Python, decorate a function with a GPU requirement, and push. The platform provisions a container, installs dependencies, downloads weights, and executes the function. When traffic stops, the container is torn down and billing halts. When traffic resumes, Modal spins up a new one.
The core primitive is the @app.function decorator. You can attach a GPU type with gpu="A10G" or gpu="A100-40GB", mount a persistent volume for weights, and pin a specific CUDA base image. The same function can be called as a regular Python function locally, invoked as a web endpoint, scheduled as a cron job, or triggered from a queue. That flexibility is unusual in this category. Replicate wants every model wrapped in a Cog predictor, and Baseten wants every deployment packaged as a Truss. Modal is closer to Cloudflare Workers or AWS Lambda, except the function happens to have an A100 attached.
Pricing is straightforward. At the time of writing, an A10G runs about $1.97 per hour, billed per second of actual execution. An A100 40GB is roughly $4.45 per hour. H100s are available for teams that need them, at a noticeable premium. There is no flat overhead fee, no minimum monthly spend, and no charge when your containers are idle and torn down. You pay for the seconds a GPU is actively running your code.
Where Modal shines is batch workloads, fine-tuning jobs, and inference pipelines that involve more than just a single forward pass. If your product does speech transcription followed by diarization followed by summarization, Modal lets you express that as three chained Python functions, each with its own GPU or CPU requirement, and the platform handles the fan out. The ergonomics for LLM fine-tunes are also strong, because you can mount a training dataset as a volume, run a multi hour job, and have results written back automatically when it completes.
The trade off is that Modal is a platform, not a marketplace. There is no public catalog of pre built models. If you want Whisper or Stable Diffusion XL, you deploy it yourself. That is fine for teams with an ML engineer, but it is a stumbling block for teams that just want to call an API.
Replicate: the model marketplace
Replicate took the opposite approach. Rather than expose raw compute, Replicate built an opinionated packaging format called Cog and an opinionated hosting layer on top of it. Every model on Replicate is a Cog container with a predict.py file, a schema for inputs and outputs, and a declared hardware requirement. Once pushed, the model becomes a versioned, callable endpoint that any API consumer can use by pasting a snippet from the documentation.
This model first abstraction is why Replicate feels closer to a SaaS than a cloud. If you want to run Stable Diffusion, you pick a community model, call replicate.run() from Python or Node, and pay per second of execution. The community has published thousands of models across image generation, speech, video, code, and LLMs. For founders prototyping an AI feature, Replicate is often the fastest path from idea to working demo.
Billing is per second, with rates that vary by GPU tier. An A100 40GB is around $0.00115 per second, which works out to roughly $4.14 per hour of active compute. Smaller GPUs like T4s are cheaper, larger ones like H100s are pricier. Replicate does not charge for idle time, and there is no overhead fee. If your model runs for 4 seconds on an A100, you pay for 4 seconds on an A100.
The downside of Replicate's abstraction is that you give up control. Cold starts are not always predictable, because the platform manages container lifecycle on shared infrastructure. If your model is popular or always warm, latency is excellent. If your model is niche and invoked once an hour, you may wait 20 seconds or more for the first response after an idle period. Replicate offers dedicated deployments for models that need guaranteed warm capacity, but that moves the pricing model closer to Baseten's.
Cog itself is polarizing. Some teams love that it imposes structure and makes models portable. Others find it restrictive, particularly when they need custom routing, non standard request shapes, or stateful behavior. If your use case fits inside a single predict function that takes JSON in and returns JSON or a file, Cog is excellent. If your use case is a streaming LLM with token by token output, complex session state, or a multi stage pipeline, Cog starts to feel like a constraint.
Baseten: dedicated deployments with cold start optimization
Baseten sits in a different part of the design space. Rather than compete with Modal on raw serverless compute or Replicate on marketplace breadth, Baseten focuses on operationalizing production model inference for teams that already know what they are deploying. Every model is wrapped in Truss, Baseten's open source packaging format, and every deployment lives on dedicated infrastructure that you configure for minimum and maximum replica counts.
Truss feels closer to a production Dockerfile than Cog does. You describe your environment, your Python dependencies, your system packages, and your model server logic in a single config.yaml plus a small Python class. The format supports custom servers, so if you want to run vLLM, TensorRT-LLM, or a homegrown inference stack, Truss will wrap it. This matters for LLM workloads where engine choice dramatically changes throughput and latency.
Pricing is where Baseten diverges from the per second billing of Modal and Replicate. You still pay for GPU hours at competitive market rates, but Baseten also charges a platform overhead that lands roughly between $0.25 and $2.50 per GPU hour depending on tier and commitment. In exchange, you get dedicated replicas, aggressive cold start optimization, autoscaling policies you actually control, model caching across instances, and a monitoring layer that rivals what you would build internally.
The cold start work is a real differentiator. Baseten invests heavily in techniques like weight pre loading, container image layering, and GPU memory snapshotting to bring LLM cold starts down from tens of seconds to a few seconds. For a product where a user hits a button and expects a response in under two seconds, this optimization is the difference between shipping and not shipping.
The trade off is cost predictability. Because you configure replica floors, you can end up paying for idle GPUs during quiet hours. Baseten gives you scale to zero as an option, but teams optimizing for p99 latency rarely use it. If your traffic is steady and your latency SLA is tight, Baseten is often the cheapest real world option. If your traffic is spiky and you can tolerate occasional cold starts, Modal or Replicate will usually come out ahead on raw bill.
Pricing breakdown with real math
Abstract per hour rates do not tell you what you will actually pay. Let us walk through three concrete workloads with numbers that reflect current list pricing.
Workload one: a Stable Diffusion XL image generator serving 50,000 requests per day. Assume 3 seconds of A100 40GB time per request. That is 150,000 GPU seconds daily, or about 41.7 GPU hours. On Modal at $4.45 per hour, your daily bill is roughly $185. On Replicate at $0.00115 per second, it is $172.50. On Baseten, you would configure two or three warm replicas plus burst capacity, which puts effective cost closer to $210 to $240 but with tighter latency. For image generation where users wait for results, Replicate or Modal usually wins on pure cost.
Workload two: a chat LLM serving 5 requests per second during business hours, p95 latency target of 2 seconds. Cold starts matter enormously here, so per second billing can be deceptive. On Modal with default settings, you will see cold start spikes that blow the SLA. You can pay for keep warm containers, which brings the math closer to Baseten. On Replicate, you use a dedicated deployment, which also looks like Baseten pricing. On Baseten with two warm A100 replicas plus autoscaling, you are looking at roughly $6,500 per month for the GPUs plus $800 to $1,200 in platform overhead. Once you factor in latency SLAs, the three platforms converge.
Workload three: a speech to text batch job processing 10,000 hours of audio per week. This is Modal's sweet spot. You can fan out across 50 A10G instances, each processing a chunk, and pay only for the compute seconds consumed. At $1.97 per hour across the batch, total weekly compute is about $600 to $900 depending on model efficiency. Replicate can do this too, but Cog is less natural for orchestrated batch work. Baseten can do it, but the overhead fee is harder to justify for a workload with no latency SLA.
The pattern is consistent. Modal wins on batch and bursty workloads where cold starts are acceptable. Replicate wins on occasional inference of community models and on shipping speed. Baseten wins when latency SLAs are strict and traffic is steady enough to justify dedicated capacity. If you are running these numbers for the first time, our cost management guide has a template for building the spreadsheet end to end.
Cold starts and latency benchmarks
Cold start performance is the single biggest variable between these platforms in production, and it is also the one that founders underestimate the most. Benchmarks published by third parties and confirmed by our own tests paint a consistent picture.
On Modal, a typical 7 billion parameter LLM takes between 8 and 15 seconds to cold start from scratch, depending on weight size and whether you are using mounted volumes or baked image layers. With volume backed weights and an A100, you can often get to first token in 4 to 6 seconds. For stateless smaller models, cold starts of 2 to 4 seconds are realistic. Modal also supports container pooling, which lets you pre warm a small fleet.
Replicate cold starts are harder to characterize because they vary with model popularity. A popular community model that is always warm responds in under a second. A cold version of the same model can take 15 to 30 seconds to boot. For your own models on Replicate with dedicated deployments, cold starts behave similarly to Baseten and Modal warm paths, which is to say a few seconds.
Baseten has invested the most engineering effort here and it shows. Their documented cold start numbers for a 13 billion parameter LLM on a warm replica swap sit around 2 to 5 seconds, and they continue to push on techniques like pre loaded weight caches and snapshot restore. If your product requires sub two second time to first token from a cold state, Baseten is the most likely platform to deliver it without custom engineering on your side.
Other platforms worth knowing about in this conversation include RunPod, which offers bare metal GPU rentals and a serverless product with competitive raw GPU prices but less polished developer ergonomics. Beam Cloud targets the same serverless niche as Modal with aggressive pricing. AWS SageMaker remains the default for enterprises already deep in AWS, despite its reputation for complexity. Together AI and Fireworks AI are specialized LLM inference providers that host popular open weights at per token pricing, which can be a better fit if you are not running a custom fine tune.
Autoscaling and reliability in practice
Autoscaling behavior is where per second billing interacts with real traffic patterns, and where platforms differ most in the details. Modal scales containers up and down aggressively, often tearing down instances within minutes of idleness. This is great for cost but creates cold start exposure if your traffic is bursty. You can pin minimum containers, but that partially defeats the serverless economics.
Replicate scales based on internal signals that users do not directly control. For marketplace models, this is a feature, because you get whatever warmth the community creates. For your own deployments, it is a constraint. Dedicated deployments let you set min and max replicas, which gives you predictable behavior at the cost of per second billing purity.
Baseten exposes the richest autoscaling controls of the three. You define min and max replicas, scaling triggers based on queue depth or concurrency, scale down delays, and cooldown windows. For a product with tight latency SLAs and variable traffic, this control is genuinely valuable. You can set a floor of two replicas during peak hours, scale to eight under load, and dial back to one overnight.
Reliability is comparable across all three for standard inference workloads. Modal and Baseten publish SLAs. Replicate is more informal about uptime but has a solid track record for marketplace models. All three run on a mix of hyperscaler GPU capacity and dedicated regions. None of them will save you from a bad model, a memory leak in your serving code, or a surprise in your upstream provider's GPU availability.
One operational point that applies to all three: treat your deployment like you would any production service. Log everything, track p50 and p99 latency separately, monitor GPU utilization, and alert on queue depth. If you are new to running inference at production scale, a lot of the same principles from our on device versus cloud AI analysis carry over, especially the parts about latency budgets and user perceived performance.
Decision matrix: which platform for which workload
Rather than crown a single winner, here is how we guide clients through the decision based on workload shape.
You are a startup prototyping an AI feature and just need something working by Friday. Use Replicate. Pick a community model that is close to what you need, paste the API snippet, ship. You can migrate to Modal or Baseten later when volume justifies it.
You are running a custom fine tuned LLM and latency is critical. Baseten. The cold start engineering is worth the overhead fee, and the autoscaling controls will save you from a 3 a.m. incident. Budget for dedicated capacity.
You are building a multi step ML pipeline with mixed CPU and GPU work. Modal. The Python native model makes pipelines easy to express, and per second billing on mixed workloads is where Modal's cost advantage is biggest.
You are running batch inference on millions of items. Modal. Fan out is the platform's natural shape, and you will not pay for idle capacity between jobs.
You are hosting an open weights model for external API customers. Replicate. The marketplace gives you distribution, and per second billing aligns your costs with customer usage.
You are an enterprise already living inside AWS. Evaluate SageMaker honestly before choosing a specialist. You may pay more per hour, but IAM, VPC, and audit integration will save you months of compliance work.
You need the cheapest raw GPUs and you have engineers who like building platforms. RunPod or Beam Cloud. You will pay less per hour and get less out of the box.
You are calling a popular open model and do not need to customize it. Together AI or Fireworks AI at per token pricing will usually beat self hosting on cost and latency until your volume is very large.
The honest meta point is that these platforms are converging. Modal is adding more managed features, Replicate is offering more dedicated deployment options, Baseten is adding burst capacity that looks like serverless. Within 18 months, the distinctions will blur further. For now, pick the one whose default shape matches your workload, keep your packaging portable so you can migrate if you need to, and revisit the decision when your bill crosses $10,000 a month.
If you want a second opinion on which platform fits your specific model, latency SLA, and traffic shape, we review deployment architectures as part of our AI engineering engagements. Book a free strategy call and we will tell you honestly whether you are on the right platform, whether you are overpaying, and what the next twelve months of infrastructure should look like.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.