If you have played Age of Empires you would know how important it is to explore the map to be proactive in managing resources and strategizing moves to maintain a strong civilization. Along similar lines, ML Observability is essential while working on ML services.
π€ What is Observability?
I would define it as tooling or a solution that allows one to proactively debug a system. It is basically exploring properties and patterns which are not determined in advance.
It might feel that observability and monitoring are the same things, but there is a slight difference:
- Monitoring: You are looking at defined properties
- Observability: You are exploring properties and performing monitoring to gain better observability
π Definition
ML observability is the ability to monitor and understand the performance, behavior, and outputs of machine learning models in real-time, allowing to:
- Proactively identify potential issues and anomalies in real-time
- Facilitate timely interventions
- Mitigate risks
ποΈ The Three Pillars of ML Observability
π₯οΈ System Monitoring
System monitoring relates to monitoring of performance of infrastructure where ML services are deployed. It involves metrics such as:
- CPU & memory usage
- Network traffic
- Disk space
- Service performance
Itβs necessary for ensuring reliability, availability, and efficiency of infrastructure.
π Inference Monitoring
Inference monitoring focuses on evaluating and auditing the real-time performance of deployed ML models in production. It involves:
- Tracking incoming requests to models
- Accuracy of the model predictions
It helps identify issues and improve overall trust in responsible AI.
π Model Monitoring
Model monitoring watches long-term accuracy of ML models. It involves monitoring key metrics such as:
- Accuracy
- Precision
- Recall
- F1-score
- Detecting and diagnosing any drift
This assures continued performance and reliability of ML models over time.
Just like in Age of Empires, successful ML systems require constant exploration and strategic monitoring to maintain a strong, reliable and βcivilizationβ of models.