3 min read
πŸ” ML Observability

If you have played Age of Empires you would know how important it is to explore the map to be proactive in managing resources and strategizing moves to maintain a strong civilization. Along similar lines, ML Observability is essential while working on ML services.

πŸ€” What is Observability?

I would define it as tooling or a solution that allows one to proactively debug a system. It is basically exploring properties and patterns which are not determined in advance.

It might feel that observability and monitoring are the same things, but there is a slight difference:

  • Monitoring: You are looking at defined properties
  • Observability: You are exploring properties and performing monitoring to gain better observability

πŸ“ Definition

ML observability is the ability to monitor and understand the performance, behavior, and outputs of machine learning models in real-time, allowing to:

  • Proactively identify potential issues and anomalies in real-time
  • Facilitate timely interventions
  • Mitigate risks

πŸ›οΈ The Three Pillars of ML Observability

πŸ–₯️ System Monitoring

System monitoring relates to monitoring of performance of infrastructure where ML services are deployed. It involves metrics such as:

  • CPU & memory usage
  • Network traffic
  • Disk space
  • Service performance

It’s necessary for ensuring reliability, availability, and efficiency of infrastructure.

πŸ” Inference Monitoring

Inference monitoring focuses on evaluating and auditing the real-time performance of deployed ML models in production. It involves:

  • Tracking incoming requests to models
  • Accuracy of the model predictions

It helps identify issues and improve overall trust in responsible AI.

πŸ“Š Model Monitoring

Model monitoring watches long-term accuracy of ML models. It involves monitoring key metrics such as:

  • Accuracy
  • Precision
  • Recall
  • F1-score
  • Detecting and diagnosing any drift

This assures continued performance and reliability of ML models over time.

Just like in Age of Empires, successful ML systems require constant exploration and strategic monitoring to maintain a strong, reliable and β€œcivilization” of models.