🔍 ML Observability | ur_dighe 🏴‍☠️

If you have played Age of Empires you would know how important it is to explore the map to be proactive in managing resources and strategizing moves to maintain a strong civilization. Along similar lines, ML Observability is essential while working on ML services.

🤔 What is Observability?

I would define it as tooling or a solution that allows one to proactively debug a system. It is basically exploring properties and patterns which are not determined in advance.

It might feel that observability and monitoring are the same things, but there is a slight difference:

Monitoring: You are looking at defined properties
Observability: You are exploring properties and performing monitoring to gain better observability

📝 Definition

ML observability is the ability to monitor and understand the performance, behavior, and outputs of machine learning models in real-time, allowing to:

Proactively identify potential issues and anomalies in real-time
Facilitate timely interventions
Mitigate risks

🏛️ The Three Pillars of ML Observability

🖥️ System Monitoring

System monitoring relates to monitoring of performance of infrastructure where ML services are deployed. It involves metrics such as:

CPU & memory usage
Network traffic
Disk space
Service performance

It’s necessary for ensuring reliability, availability, and efficiency of infrastructure.

🔍 Inference Monitoring

Inference monitoring focuses on evaluating and auditing the real-time performance of deployed ML models in production. It involves:

Tracking incoming requests to models
Accuracy of the model predictions

It helps identify issues and improve overall trust in responsible AI.

📊 Model Monitoring

Model monitoring watches long-term accuracy of ML models. It involves monitoring key metrics such as:

Accuracy
Precision
Recall
F1-score
Detecting and diagnosing any drift

This assures continued performance and reliability of ML models over time.

Just like in Age of Empires, successful ML systems require constant exploration and strategic monitoring to maintain a strong, reliable and “civilization” of models.