Technology·14 min read

How to Set Up Application Monitoring and Alerting for Your Product

Most teams discover their app is broken when a user complains. Proper application monitoring means you know before they do. Here is how to build a monitoring stack that actually works.

N

Nate Laquis

Founder & CEO ·

Why Most Products Are Flying Blind

The typical startup monitoring setup is a Slack bot that fires when the server is completely unreachable. That is not monitoring. That is a post-mortem notification. By the time your uptime checker reports a 503, users have already experienced errors, support tickets are coming in, and the incident has been live for minutes or hours.

Application monitoring done correctly tells you four things: when errors occur and what caused them, whether your infrastructure is healthy or degrading, what your users are actually experiencing, and whether your system is trending toward a problem before it becomes one. The goal is to be the first person who knows something is wrong, not the last.

Modern monitoring is built on three pillars: metrics (numerical measurements over time), logs (text records of events), and traces (records of how a request moved through your system). Most teams need all three, but you can start with two and add the third as you scale. This guide covers how to build a complete stack from scratch, what to prioritize at each growth stage, and what you should expect to pay.

Application monitoring dashboard showing real-time metrics, error rates, and system health indicators

Pillar One: Error Tracking with Sentry

Sentry is the starting point for almost every serious application. It captures unhandled exceptions, tracks error frequency and affected users, and links errors to the exact line of code and deployment that introduced them. Setup takes under an hour and pays for itself the first time it catches a bug before a user files a ticket.

What Sentry Actually Does

When an exception occurs in your application, Sentry captures the full stack trace, the request context (URL, headers, user identity if configured), the breadcrumbs leading up to the error (recent console logs, network requests, UI events), and the release version. It then groups duplicate errors together so you see "this error happened 400 times affecting 32 users" rather than 400 separate notifications.

Sentry supports every major language and framework: JavaScript, TypeScript, Python, Ruby, Go, Java, Swift, Kotlin, and more. For a Next.js application, installation is adding a Sentry package, running a wizard that generates a config file, and deploying. Total setup time is about 30 minutes including source map configuration.

Source Maps and Releases

If you deploy minified JavaScript without source maps, Sentry shows you minified stack traces that are nearly useless. Upload source maps as part of your CI/CD pipeline using the Sentry CLI or the official webpack/Vite plugins. Pair this with release tagging (each deployment gets a unique release identifier) and Sentry can tell you exactly which deployment introduced a new error and how it compares to the previous release in terms of error volume.

Sentry Pricing

The free tier covers 5,000 errors per month and 10,000 performance transactions, which is sufficient for an early-stage product. The Team plan starts at $26/month for 50,000 errors and unlocks features like release tracking, custom dashboards, and extended data retention. Business plans start at $80/month and add custom alert rules and higher quotas. For most startups, Team is the right tier through Series A.

Pillar Two: Infrastructure Monitoring with Datadog and Grafana Cloud

Error tracking tells you when your code breaks. Infrastructure monitoring tells you when your servers, databases, and services are sick. These are different problems with different tools.

Server infrastructure monitoring with network diagrams, CPU usage graphs, and database performance metrics

Datadog

Datadog is the industry standard for infrastructure monitoring at scale. Install the Datadog agent on your servers (or use their Kubernetes operator) and it immediately starts collecting CPU, memory, disk, and network metrics. Add integrations for PostgreSQL, Redis, Nginx, or any other service you run and you get deep metrics with no additional configuration. Datadog's dashboards are excellent out of the box, and their APM (Application Performance Monitoring) product traces requests across services.

The primary downside is cost. Datadog charges per host per month: $15/host for infrastructure only, $31/host when you add APM. For a 10-server production environment with APM, that is $310/month before any log ingestion, synthetics, or other add-ons. Log ingestion adds $0.10 per GB ingested plus $0.05 per GB indexed per month, which adds up fast. A medium-size application generating 50 GB of logs per day will spend $450/month on logs alone.

Grafana Cloud

Grafana Cloud is the open-source-friendly alternative. It combines Grafana (dashboards), Prometheus (metrics), Loki (logs), and Tempo (traces) into a managed hosted platform. The free tier is generous: 10,000 active metrics series, 50 GB of logs, 50 GB of traces, and 500 GB of profiles per month. The Pro tier is $8/month per user plus usage-based pricing for data volume. For most startups, Grafana Cloud is significantly cheaper than Datadog for equivalent functionality.

The tradeoff is setup complexity. Datadog's integrations are mostly one-click. Grafana Cloud requires you to configure Prometheus exporters, write PromQL queries for custom dashboards, and manage your own alerting rules in Alertmanager or Grafana's alerting UI. The tooling is powerful but assumes more operator knowledge. Teams comfortable with Kubernetes and configuration-as-code will prefer Grafana Cloud. Teams that want to minimize operational overhead will prefer Datadog.

Key Infrastructure Metrics to Track

  • CPU utilization: Alert at 80% sustained for more than 5 minutes. Spikes are normal; sustained high CPU indicates a scaling or code efficiency problem.
  • Memory utilization: Alert at 85%. Unlike CPU, memory pressure often precedes OOM kills and application crashes rather than just slowdowns.
  • Database connection pool utilization: Alert when your pool is more than 70% full. Connection pool exhaustion causes cascading failures that look like application errors.
  • Disk I/O and space: Alert at 80% disk utilization and when I/O wait exceeds 20% sustained. Disk full is one of the most common and most preventable production incidents.
  • Application error rate: The percentage of HTTP requests returning 5xx errors. Alert when it exceeds 1% over a 5-minute window.
  • p95 response time: The response time that 95% of requests complete within. Alert when it degrades more than 50% from your baseline.

Pillar Three: Log Aggregation with Loki and BetterStack

Logs are the most detailed record of what your application is doing. When an error occurs and you need to understand why, logs give you the context that metrics and traces cannot. The challenge is that logs are voluminous, and parsing them across multiple servers in SSH sessions is not a viable production workflow.

Grafana Loki

Loki is Grafana's log aggregation system. Unlike Elasticsearch (used by the ELK stack), Loki indexes only log metadata (labels like service name, environment, and host) rather than the full log content. This makes it dramatically cheaper to operate at scale. Log content is stored compressed and queried using LogQL, a query language similar to PromQL. If you are already using Grafana Cloud for metrics, adding Loki costs nothing within the free tier limits and gives you a unified interface for metrics and logs.

To ship logs to Loki, install Promtail (the Loki log shipper) on your servers, or use the Loki Docker driver if you are containerized. For Kubernetes, the Grafana Loki stack Helm chart sets up collection across the entire cluster in one command. Structured logging (JSON format) makes Loki queries significantly more powerful, since you can filter by specific fields rather than regex-matching unstructured text.

BetterStack (Formerly Logtail)

BetterStack is the easier alternative. It offers managed log ingestion with a polished UI, full-text search, SQL-like query interface, and built-in alerting. The free tier includes 1 GB per month with 3 days of retention. The Starter plan is $25/month for 5 GB and 7 days retention. The Team plan is $50/month for 10 GB and 30 days retention, which is sufficient for most early-stage products.

BetterStack also includes an uptime monitoring product (covered in the next section), making it a good choice for teams that want to consolidate vendors. The BetterStack integration library covers most common languages and frameworks, and their structured logging format recommendations are solid starting points for any new project.

What to Log

Log at the right level of granularity. Too little and you cannot diagnose problems. Too much and storage costs balloon while signal drowns in noise. Every HTTP request should log: timestamp, method, path, status code, response time, user ID (if authenticated), and request ID (for tracing). Errors should log the full exception with stack trace, request context, and any relevant application state. Avoid logging sensitive data: passwords, payment card numbers, API keys, or personal health information. Structure every log line as JSON from day one. Retrofitting structured logging onto an existing application is painful.

Uptime Monitoring: BetterStack, Pingdom, and UptimeRobot

Uptime monitoring is the simplest layer of your monitoring stack and the one you should set up first. It answers one question: is your application reachable from the public internet right now? This is different from your infrastructure metrics, which tell you about server health from inside your network.

How Uptime Monitors Work

Uptime monitoring services run HTTP checks against your application from multiple geographic locations every 1 to 5 minutes. If your endpoint returns a non-2xx response or fails to respond within a timeout, the service triggers an alert. Good uptime monitors check from at least 3 locations to distinguish regional outages from global ones, and they wait for 2 consecutive failures before alerting to reduce false positives.

Tool Comparison

UptimeRobot is the entry-level option. The free tier allows 50 monitors with 5-minute check intervals. The Pro plan is $7/month per user and drops to 1-minute checks. It covers the basics well and the price is hard to beat. Suitable for side projects and early-stage products.

BetterStack's uptime product (formerly Better Uptime) starts free for 10 monitors with 3-minute intervals and scales to $20/month for 1-minute checks, unlimited monitors, and on-call scheduling. The advantage is native integration with BetterStack's log management, so you can correlate downtime events with log entries in a single UI. Their status page product is also included, which customers expect when there is an outage.

Pingdom is the enterprise option, starting at $10/month for basic plans. Its differentiator is real user monitoring (RUM), which collects performance data from actual users' browsers and reports page load times by geography and device type. Pingdom is worth the premium when you have enough traffic to make RUM data statistically meaningful, typically 10,000 or more monthly active users.

What to Monitor Beyond the Homepage

Most teams only monitor their root URL. You should also monitor: your authentication endpoint (login failures affect all users), your primary API health check endpoint, your Stripe webhook receiver (payment failures are invisible without it), your background job queue (a stopped queue is silent but critical), and any third-party status pages you depend on. Monitoring your homepage while your payment processing is broken means you are responding to the wrong incident.

Uptime monitoring dashboard with status checks, response time graphs, and incident history for web applications

Alerting Without Alert Fatigue

The most common monitoring failure is not a lack of alerts. It is too many alerts. When every alert is urgent, no alert is urgent. Engineers start ignoring Slack notifications, on-call rotations become miserable, and the team responds slower to real incidents than they did before monitoring was set up. Alert fatigue is a culture and systems problem, not a volume problem.

Severity Levels

Define explicit severity levels and stick to them. A common three-tier system: P1 is production down or major functionality broken for all users, requires immediate response at any hour. P2 is a significant degradation affecting some users or a critical path feature, requires acknowledgment within 30 minutes during business hours. P3 is a non-critical issue or degraded performance, addressed during the next business day.

Every alert must be assigned a severity level before it is deployed. "Alert if error rate is over 1%" is incomplete. "P1 alert if error rate is over 5% for 5 consecutive minutes; P2 alert if error rate is over 1% for 15 consecutive minutes" is actionable and calibrated.

Noise Reduction Techniques

Use time-based conditions rather than point-in-time checks. A single spike to 90% CPU is not an incident. Ninety percent CPU sustained for 10 minutes might be. Set alert conditions to require the threshold to be exceeded for at least 3 to 5 consecutive data points before firing. This eliminates most false positives from transient spikes.

Group related alerts. If your database goes down, you will receive alerts from every service that depends on it. Configure your alerting system to suppress downstream alerts when a root-cause alert is already active. Datadog and PagerDuty both support event correlation and alert grouping. Without this, a single database failure generates 20 simultaneous pages that all need to be acknowledged separately.

Maintain a runbook for every alert. The runbook does not need to be long: one paragraph explaining what the alert means, the most likely causes, and the first 3 steps to investigate. Attach the runbook link directly to the alert notification. An engineer woken at 3 AM should be able to understand the alert and start investigating without needing to think.

On-Call Rotation with PagerDuty and Opsgenie

PagerDuty and Opsgenie (now Atlassian Incident Management) both handle on-call scheduling, escalation policies, and multi-channel notifications. PagerDuty's free tier covers a single user. The Professional plan is $21/user/month and supports unlimited teams and escalation policies. Opsgenie's free tier covers 5 users, making it better for small teams. The Standard plan is $9/user/month.

For a team of 2 to 5 engineers, Opsgenie's free or Standard tier is sufficient. For larger teams with complex escalation requirements (dedicated SREs, multi-team on-call coverage), PagerDuty's more mature tooling is worth the premium. Both integrate with Datadog, Grafana, Sentry, and most other monitoring tools via webhooks.

Rotate on-call responsibility weekly and track alert volume per person over time. If one engineer is receiving 30 pages per week and another is receiving 5, your alert configuration is poorly calibrated. The goal is fewer than 5 actionable pages per engineer per on-call week. More than that and engineers burn out, which means your monitoring program is making your team less reliable, not more.

Recommended Monitoring Stacks by Growth Stage

The right monitoring stack depends on your user count, revenue, and team size. Over-investing in monitoring infrastructure at pre-launch wastes time and money. Under-investing at scale creates incidents that cost far more than the tooling would have.

Pre-Launch to 1,000 Users: Lean and Functional

At this stage, your goal is basic visibility with minimal setup time. Install Sentry on the free tier for error tracking. Add UptimeRobot's free tier for uptime checks on your main URLs. Use your cloud provider's native metrics (AWS CloudWatch, Google Cloud Monitoring, or Vercel/Railway analytics) for infrastructure. Total cost: $0 to $26/month. Time to set up: 2 to 4 hours.

This stack will not catch everything, but it will catch the errors and outages that affect users. At this scale, you are still moving fast enough that deep observability would slow you down more than the incidents cost you.

1,000 to 10,000 Users: Add Log Aggregation and Better Alerting

Upgrade Sentry to Team for release tracking and better alert controls ($26/month). Add BetterStack for logs ($25/month) and uptime monitoring with 1-minute intervals ($20/month). Set up Grafana Cloud free tier for infrastructure metrics and connect it to your cloud provider via the Prometheus integration. Configure basic alerting rules for error rate, response time, and disk usage. Add PagerDuty or Opsgenie for on-call routing. Total cost: $70 to $150/month. Time to set up: 1 to 2 days.

10,000 to 100,000 Users: Full Observability Stack

At this scale, incidents are expensive. A 1-hour outage with 50,000 MAU is a real brand and revenue event. Invest in a comprehensive stack. Upgrade to Datadog with APM ($31/host/month for 5 to 10 hosts) for infrastructure metrics and distributed tracing. Keep Sentry Team or upgrade to Business depending on error volume. Use Datadog Log Management or Loki on Grafana Cloud for log aggregation. Add Pingdom for RUM data. Implement formal on-call rotation with weekly rotations and escalation policies. Total cost: $500 to $1,500/month depending on host count and log volume. Time to implement fully: 1 to 2 weeks.

Beyond 100,000 Users: Dedicated Reliability Engineering

At this scale, monitoring configuration is a full-time responsibility. You need SLOs (Service Level Objectives) defined for each critical service, error budget tracking, automated runbooks for common incidents, and capacity planning dashboards. Datadog's full platform typically runs $3,000 to $10,000/month for larger deployments. Many teams at this scale hire a dedicated SRE or platform engineer whose primary job is keeping the observability stack accurate and actionable.

Implementation Costs and Next Steps

The cost of setting up monitoring is modest relative to the cost of the incidents it prevents. Here is a realistic budget breakdown for a professional implementation.

  • Monitoring setup for MVP or early-stage product (4 to 8 hours, $500 to $1,500): Sentry installation with source maps and release tracking, UptimeRobot or BetterStack for uptime checks on key endpoints, basic Slack alerting, and documentation of alert thresholds. Ongoing cost: $0 to $50/month.
  • Full observability stack for growth-stage product (2 to 5 days, $3,000 to $8,000): Datadog or Grafana Cloud setup, Sentry with custom dashboards, structured log aggregation, on-call rotation with PagerDuty or Opsgenie, runbooks for common alerts, and a formal severity classification system. Ongoing cost: $200 to $800/month.
  • Enterprise monitoring and SRE setup (2 to 4 weeks, $15,000 to $40,000): SLO definition and error budget tracking, distributed tracing across microservices, automated incident response playbooks, chaos engineering integration, and executive-level reliability dashboards. Ongoing cost: $1,000 to $5,000/month in tooling.

The True Cost of Skipping Monitoring

A single production outage that lasts 4 hours for a SaaS product with 5,000 users typically generates 50 to 200 support tickets, a churn increase of 2 to 5% in the following month, and 8 to 16 engineering hours for investigation and remediation. At a $150/hour blended engineering cost, that is $1,200 to $2,400 in engineering time alone, before accounting for lost revenue and customer trust. A $3,000 monitoring setup that catches that incident before users notice it pays for itself on the first incident it prevents.

The teams that ship reliably are not the ones with the most engineers. They are the ones that know what is happening in their systems at all times. If you are launching a product or struggling with an existing monitoring setup, we can help you build something that works. Book a free strategy call to talk through your observability requirements.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

application monitoringSentryDatadogalertingobservability

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started