Algorithmic Service Operations for IT Incidents
- Series E+
BigPanda Inc. is a leader in Algorithmic Service Operations for enterprise IT. Our machine learning platform intelligently automates and scales Service Operations to meet the complex demands of the modern datacenter. We turn IT noise from fragmented clouds, teams, applications and monitoring tools into actionable insights to speed the resolution of IT incidents. Many of the worlds largest enterprises rely on BigPanda to power their Service Operations.
Board Members and Advisors
AI Technology Stack
The BigPanda Machine Learning Engine processes your company’s alert stream to identify interesting patterns that may improve correlation. Those patterns appear as suggestions in the Correlation Patterns settings menu and can be reviewed and activated by an admin. At its core, BigPanda’s Algorithmic Correlation relies on pattern recognition. A pre-configured list of patterns is matched against the alerts stream to identify alert clusters in real-time. Our approach looks at information in 4 dimensions – time, topology (i.e., datacenter, rack, cluster), context (i.e., criticality, team, customer impact) and alert types (i.e., network, storage, application) – to classify alerts into Incidents. A single pattern describes the general properties of a cluster: timespan, common alert attributes, and a filter. Below are examples of different patterns. Connectivity alerts Alerts triggered by devices attached to a single network in a 15-minute timespan Load-related alerts Alerts triggered by multiple servers supporting a single database in a 2-hour timespan Common application alerts Alerts triggered by tools like Splunk and AppDynamics in a 30-minute timespan In general, correlation patterns are created by Administrators and the BigPanda Customer Success team. The Machine Learning Engine provides supplementation by autonomously creating its own set of patterns. These patterns are inputted into the system as suggestions and can optionally be activated. The suggested patterns are designed to function on their own to provide great correlation. As another added benefit, the new patterns can be modified as desired to complement an existing set of in-use patterns. The end result is better correlation reach, with more signal and higher quality incidents. BigPanda’s Machine Learning Engine will generate correlation patterns automatically based on historical user data. Upon the integration of a monitoring tool, the review process begins and an automatically generated pattern will be exposed in the Correlation Patterns settings menu in a few days. The rate at which the first pattern is generated is dependent upon the richness and size of the available data. Over time, as more data flows through the system, additional patterns will be recommended at an increased and variable rate. Once the Machine Learning Engine suggests a pattern, administrators can decide to activate it into use, reject it, or further customize it within the editor manually. The Real-Time Preview in the patterns editor and Unified Search capabilities are instrumental to identify the utility of these patterns and how they can be modified, if necessary, to produce the correlation results the business requires. BigPanda’s Machine Learning Engine is unsupervised in function and does not require training. It works by clustering the alert stream into well-defined Incidents. It will run autonomously in the background as soon as relevant data is present. Unlike supervised machine learning, human interaction and consistent input are not required for its upkeep and efficacy. A risk of employing unsupervised machine learning is the lack of disclosure and insight into its actions. Loss of control and unpredictability happens when machine learning is granted authority to implement changes on its own behalf. Our unsupervised approach uniquely maintains transparency and consistency in forming Incidents. It does not enact changes. The suggestion model for generated patterns is utilized to grant Administrators full discretion and control of code changes in a Production environment.
Deep Learning, Machine Learning, Natural Language Processing
Online Learning, Semi-Supervised Learning, Supervised Learning, Unsupervised Learning
Algorithms and Techniques
Neural Networks, Time Series Analysis
Amazon Web Services
Frameworks, Libraries and Tools
AWS Sagemaker, Docker, Kafka, Kubernetes, mlFlow, NumPy
Python, Scala, SQL