Sunday, March 29, 2026

Cisco IT’s community observability transformation

From knowledge overload to enhanced digital resilience. Cisco IT unified telemetry knowledge throughout its huge community, enabling automation to deal with 99.998% of alerts and reaching zero main incidents empowering engineers to proactively handle community well being at scale. 

The info drawback: overload, restricted perception, and silos

Cisco IT manages an unlimited, complicated surroundings with tons of of 1000’s of belongings – together with computer systems, switches, entry factors, dwelling gadgets, and a big selection of functions and providers – in addition to exterior methods like web service and cloud suppliers. Every of those belongings generates telemetry, presenting a problem to successfully monitor and make sense of excessive volumes of various knowledge throughout our surroundings.

In our earlier community operations mannequin, we outsourced a perform liable for community observability monitoring, second-level assist for triage, and technical experience. This outsourced perform relied on conventional monitoring strategies involving handbook processes and siloed dashboards.

In consequence, we lacked management to tailor how telemetry was processed, routed, and actioned – resulting in generic metrics and restricted perception into essential areas like consumer expertise and software efficiency. For instance, whereas we may see that the community was operational, we had restricted visibility into essential areas like consumer expertise and software efficiency.

Recognizing this knowledge drawbackwe determined to deliver the outsourced community operations perform in-house. This gave us full management to design and implement a modernized community observability technique, enabling us to raised leverage our wealth of telemetry and in the end strengthen Cisco’s digital resilience.

Nonetheless, this shift wasn’t nearly altering group duties. It additionally meant dropping our current community observability system and requiring our smaller group to handle the large quantity of telemetry knowledge.

So as to add to the stress, attributable to contractual obligations, we got simply 40 days to make this transition and construct a totally new community observability system.

Contained in the blueprint: Constructing a contemporary observability system

The duty at hand wasn’t simply to switch and mirror the outsourced community operations and legacy observability system, however to construct one thing higher. We needed to construct a system that would deal with huge volumes of information, ship deeper, actionable, and proactive insights, and allow a leaner group to be extra productive.

To attain this, we designed a community observability mannequin targeted on three key areas:

  1. Gather: Gathers telemetry and metrics from 1000’s of gadgets, functions, and platforms – each inside owned and unowned, exterior environments
  2. Monitor: Makes use of instruments and algorithms to course of and analyze the collected knowledge, serving to to determine patterns, anomalies, and potential points throughout the community
  3. Act: Initiates human or automated responses when recognized issues meet predefined rule standards, enabling well timed remediation.

Network observability model: collect, monitor, actNetwork observability model: collect, monitor, act

Determine 1: Cisco IT’s community observability mannequin

Whereas this technique is run by a centralized networking group, knowledge and rule creation are democratized – permitting engineers and repair house owners throughout IT to outline and customise their very own alert guidelines by way of GitOps. This ensures the system adapts to distinctive and evolving enterprise wants.

To function this community administration technique, we use a mixture of Cisco options:

  • Cisco’s community administration options, together with Catalyst Heart, SD-WAN Supervisor, Meraki Dashboardand Nexus Dashboardacquire and monitor detailed telemetry, efficiency metrics, and safety standing knowledge on their respective belongings. This offers complete visibility and assurance, along with their different core capabilities for managing community gadgets.
  • ThousandEyes offers real-time, end-to-end visibility into community and software efficiency. It additionally extends this visibility into exterior, unowned environments comparable to public web and cloud providers. These granular insights feed into the observability system, giving us an entire view of consumer expertise and connectivity – irrespective of the place workers are working.
  • Splunk Cloud Platform acts as a unified operations dashboard – aggregating and visualizing telemetry knowledge from the above options that had been beforehand siloed. It permits real-time monitoring, enabling engineers to rapidly concentrate on probably the most essential alerts.

Collectively, Splunk and ThousandEyes enable us in Cisco IT to proactively monitor, analyze, and act on thousands and thousands of occasions every day.

Cisco IT network observability system toolsCisco IT network observability system tools

Determine 2: Cisco IT’s observability system instruments and integrations

Automation is a essential part of our community observability technique. By feeding telemetry knowledge and incident outcomes into our Massive Language Fashions (LLMs) and automation methods, we will effectively course of and prioritize thousands and thousands of every day alerts to cut back engineer workload and velocity up response occasions, bettering end-user expertise.

The payoff: Enhanced resilience, effectivity, and past

From the start, we acknowledged that this initiative would contain important upfront work. Nonetheless, the outcomes have far exceeded our preliminary expectations.

Since deploying this new observability technique and system:

  • 0 main incidents have occurred, down from 3-4 per quarter beforehand.
  • 10x extra telemetry knowledge is being monitored, enabling broader and deeper insights into community well being, software efficiency and consumer expertise at a subsequent stage of element.
  • 4x higher visibility, with every day alert quantity rising from tons of of 1000’s to 4 million, leading to earlier detection and proactive decision of potential points earlier than they escalate.
  • Automation now handles 99.998% of 4 million every day alerts generated, minimizing the necessity for handbook intervention, and enabling sooner identification and determination of points by real-time, automated triage and response workflows.

Maybe most significantly, this effort laid a basis that allows us to repeatedly scale our AI-driven automation and prolong AIOps capabilities throughout the broader Cisco IT surroundings.

Classes realized: Methods that made the distinction

Modernizing our observability technique and system was a fast-paced journey, stuffed with worthwhile classes. Listed below are some key takeaways and methods to assist different groups trying to do the identical:

  • Collaborative possession: Usher in subject material consultants from throughout the group, share data extensively, and construct a democratized tradition the place everybody has a stake in observability and operational success.
  • Gather telemetry from in all places: Complete monitoring begins with capturing knowledge throughout your whole surroundings.
  • Knowledge normalization and enrichment: Unifying various knowledge sources is essential for holistic visibility. Spend money on a high-quality, well-maintained CMDB to maintain your stock and knowledge correct. Use your CMDB to complement alerts with enterprise context, possession, and criticality.
  • Rule experimentation: Encourage democratized groups to develop and refine alerting and automation guidelines to maintain alert volumes manageable and related.
  • AI-driven automation: Feed enriched knowledge into automation and LLMs to streamline remediation and take steps towards self-healing operations.

We’re thrilled and pleased with the work and outcomes that our groups have achieved, however our journey doesn’t finish right here. We’ll proceed to iterate, enhance, and advance our AI-driven automation capabilities.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles