How to Create a Production Dashboard in Looker
Creating a good production dashboard in Looker is about turning a flood of system data into a clear, actionable story of your application's health. The goal is a go-to view that tells your engineering, DevOps, and product teams what's working, what's breaking, and why. This guide will walk you through the essential steps to plan, model, and build a production monitoring dashboard that people will actually use.
What is a Production Dashboard and Why Do You Need One?
A production dashboard is a centralized, real-time control panel for your live application or system. It visualizes key performance indicators (KPIs) related to stability, performance, and usage. Instead of SSH-ing into servers or trailing logs in a command line when something feels slow, you get an immediate visual summary.
A well-built monitoring dashboard helps you:
Detect Issues Proactively: Spot trends like a creeping error rate or rising memory usage before they cause a full-blown outage.
Reduce Mean Time to Resolution (MTTR): When an incident does occur, a dashboard provides the critical context needed to diagnose the root cause faster.
Track Service Level Objectives (SLOs): Measure and report on reliability goals like uptime and latency, ensuring you’re meeting promises to your users.
Understand User Impact: Correlate system performance with user activity. Does a spike in server latency lead to fewer completed checkouts?
Before You Build: Planning Your Production Dashboard
Jumping straight into building visuals is tempting, but a bit of forethought will make your dashboard infinitely more useful. A great dashboard starts with great questions.
Define Your Key Metrics
Start by identifying the handful of metrics that truly define the health of your system. Avoid the urge to track everything, focus on signals that are directly tied to performance and stability.
Here are some fundamental metrics to consider:
Latency: How long do requests take to process? Track the average, but more importantly, track the long tail with p95 or p99 (95th or 99th percentile) latency. This shows the experience for your slowest 5% or 1% of requests.
Error Rate: What percentage of requests are failing? It’s crucial to distinguish between different error types (e.g., server-side HTTP 5xx errors vs. client-side 4xx errors).
Throughput: How much traffic is your system handling? Usually measured in requests per second (RPS) or transactions per minute.
Uptime / Availability: What percentage of the time is your service operational? This is often the highest-level SLO.
Resource Utilization: How are your servers holding up? Keep an eye on CPU, memory, and disk space usage to anticipate scaling needs.
Identify Your Data Sources
Your monitoring data probably lives in several places. The goal is to get it into a SQL database that Looker can connect to. Common sources include:
Application Logs: Data from services like Logstash, Fluentd, or directly from your application.
Cloud Monitoring Tools: Metrics from AWS CloudWatch, Google Cloud Monitoring, or Azure Monitor.
Application Performance Management (APM): Data from tools like Datadog, New Relic, or Dynatrace.
Database Tables: Direct performance metrics from your production database.
Typically, you’ll use an ETL/ELT pipeline (like Fivetran or Stitch) to consolidate this data into a data warehouse like BigQuery, Snowflake, or Redshift, where Looker can then access it.
Know Your Audience
An on-call Site Reliability Engineer (SRE) and a Product Manager look for different things. Tailor sections of your dashboard to their needs.
Engineers & SREs: Need granular system-level data. Latency percentiles, specific error codes, CPU utilization by host, and recent failing log entries are vital.
Product Managers: Care more about user impact. They’ll want to see active users, feature adoption rates alongside error spikes, and geographic performance.
Modeling Your Data with LookML
With planning complete, it’s time to build the data model in LookML. This semantic layer is where Looker’s power truly shines, allowing you to define your business logic once and reuse it everywhere.
Create the View
First, create a new view file (e.g., production_logs.view.lkml) from your primary table in the data warehouse. This view will serve as the foundation for all your metrics.
Define Your Dimensions
Dimensions are the “group by” fields - the attributes of your data. Define dimensions for all the important contextual fields in your logs.
Define Your Measures
Measures are your aggregated calculations (or KPIs). This is where you translate your key metrics from the planning stage into reusable fields. You'll define measures for request counts, average latency, and of course, your error rate.
Building Your Production Dashboard Visuals
With your LookML model ready, creating the visualizations becomes a simple process of dragging and dropping fields in a Looker Explore.
From your Explore, start building out individual tiles and saving them to a new dashboard titled "Production Health Summary" or similar.
1. Top-Line KPIs
Start with the most critical numbers at the very top. Use Single Value visualizations for these. Every visitor should immediately see:
Server Error Rate (Last Hour)
p95 Latency (Last Hour)
Requests Per Minute
Availability (Last 7 Days)
2. Time Series Trends
Next, show how these metrics are trending over time. These are best visualized as Line Charts. Create tiles for:
Requests & Errors Over Time: Plot
total_requestsandserver_error_counton the same chart (using dual axes if the scales are different). This immediately shows if error spikes correlate with traffic spikes. Use a 6 or 12-hour default timeframe.Latency Trends Over Time: Plot
average_latency_msandp95_latency_mson another line chart. Watching the p95 metric is crucial, as it's a better indicator of user-perceived slowness.
3. Error Breakdowns
It's not enough to know the error rate, you need to know what's causing the errors. Use a Bar Chart or Table to show:
Errors by Service: If you run a microservices architecture, show
server_error_countgrouped by theservice_namedimension.Errors by Status Code: Group
server_error_countbystatus_codeto see if a specific type of error (like a502 Bad Gateway) is dominating.
4. Resource Utilization
Add Gauge visualizations to show a snapshot of current resource health, like CPU Utilization % and Memory Utilization %. Set color thresholds (green, yellow, red) to make it obvious when levels are critical.
Making the Dashboard Actionable
A pretty dashboard is nice, but an actionable one is a game-changer. Integrate features that turn insights into action.
Add Interactive Filters
Add dashboard-level filters at the top. The most important one is Timeframe. Other useful filters could be service_name or data_center_region. This allows users to drill down from a global view into a specific problem area on their own, without needing to edit the tiles.
Set Up Alerts
Don't rely on someone having the dashboard open 24/7. Use Looker's built-in alerting to turn it into an active monitoring powerhouse. Create an alert on your Server Error Rate tile, set a threshold (e.g., if the value is above 2%), and schedule it to send a notification directly to a Slack channel or email list. This moves your team from a reactive to a proactive state.
Use Text Tiles for Context
The biggest mistake teams make is assuming everyone understands what they’re looking at. Add Text Tiles with simple Markdown to:
Define metrics: Add a tile explaining, "p95 Latency represents the experience for 95% of our users. We aim to keep this below 500ms."
Provide links to documentation: Include links to the team's incident response runbook, a relevant Confluence page, or the on-call schedule.
Explain who to contact: Add notes like, "For issues with the checkout service, contact the #ecomm-eng team on Slack."
Final Thoughts
Building a valuable production dashboard in Looker is about more than just charts, it's about translating complex system observability signals into clear, shared understanding. By carefully planning your metrics, building a robust LookML model, and adding actionable features like alerts and filters, you create a powerful tool that helps your team build more reliable software.
That process - connecting scattered data sources and modeling them into clear insights - is a challenge many teams face, especially in marketing and sales. Instead of complex model-building, we created Graphed to help teams instantly unify data from sources like Google Analytics, Shopify, and Salesforce and build dashboards using simple conversational language. If you ever find yourself struggling to pull together cross-platform performance reports, it’s a much faster way to get answers without the steep learning curve.