Monitoring NVIDIA GPU Usage in Kubernetes With Prometheus

If you’re familiar with the growth of ML/AI development in recent years, you’re likely to be aware of leveraging GPUs to speed up the intensive calculations required for tasks like Deep Learning. Using GPUs with Kubernetes allows you to extend the scalability of K8s to ML applications.

However, Kubernetes does not inherently have the ability to schedule GPU resources, so this approach requires the use of third-party device plugins. Additionally, there is no native way to determine utilization, per-device request statistics, or other metrics—this information is an important input to analyzing GPU efficiency and cost, which can be a significant expenditure.