
Senior Software Engineer at Nvidia
High cardinality in observability data driven by the proliferation of unique metric labels, log attributes, and trace identifiers is a growing concern in data-rich, AI-driven environments. In a recent survey, over 70% of DevOps teams flagged cardinality as a top operational challenge, citing increased costs and degraded performance. This session dives into the hidden impact of cardinality explosion, where uncontrolled growth in time series data can increase storage requirements by 5–10x and inflate costs by hundreds of thousands annually. Real-world case studies including one Kubernetes environment where just three new labels triggered major query latency illustrate the practical risks. You’ll learn actionable strategies that work: strategic label design that reduced time series by 90% in a major e-commerce platform; pre-aggregation and dynamic sampling that cut storage needs by 70% without losing insight; and lifecycle management tactics like tiered storage and selective retention, enabling 50%+ cost reductions. We’ll also explore tooling such as Prometheus relabeling and cardinality limiters that proactively prevent metric bloat. The session closes with a look at proactive monitoring using meta-metrics and anomaly detection to spot issues before they spiral. Whether you're scaling observability in a growing AI stack or wrangling complexity in distributed systems, you’ll leave with a robust playbook to manage cardinality without sacrificing data depth or system reliability.