Skip to content

Conversation

@yuvipanda
Copy link
Member

@yuvipanda yuvipanda commented Dec 5, 2025

While working on 2i2c-org/infrastructure#7166, I realized
that for alerts on a per-user level to be useful, it's critical that we have information about
per-user limits. Currently limits are overriden per-user, and in the future we may have
groups too.

While my initial suggestion
to improve the experience was to run dirsize-exporter in the same pod (so mounted with xfs not NFS),
that won't give us per-user limits. We should still do that, since dirsize-exporter offers
additional info (about total number of files + the oldest file in a dir, both helpful information)
that isn't present in the xfs info.

This PR adds:

  • Infrastructure for prometheus metrics in the reconciler script
  • Two metrics - total_size_bytes (total used size) and hard_limit_bytes (hard limit) per
    directory.
  • Default namespace for the metric names is dirsize, so that existing graphs in grafana
    will work regardless of the source of the metrics. This is important since not everyone
    using jupyterhub will be using jupyterhub-home-nfs
  • Switch to using a Kubernetes service to tell prometheus what to scrape, rather than
    using pod annotations. This allows us to scrape multiple ports on the same pod, so
    we can continue using node-exporter for disk metrics while getting usage metrics
    out of jupyterhub-home-nfs, and in the future getting other metrics out of dirsize
    exporter
  • Fix bug causing storageClassName to not be set on PVC. Was necessary for local
    testing.

An alternative I considered is to add support for XFS quotas in prometheus-dirsize-exporter.
However, since home-nfs puts projid and projects files not in their standard locations,
that would be a little more complex. Plus we may not have info about what dirs to report
on and what to not (we only want to report on the paths we manage). So I kept it in here.
Happy to reconsider too.

We should add an option to https://github.com/2i2c-org/prometheus-dirsize-exporter to
allow disabling the dirsize total size bytes so we don't have duplicate metrics.

Validated that this works and prometheus does pick up the metrics:

image

TODO

  • Make the namespace of the metrics configurable
  • Expose the metrics via the helm chart
  • Document the metrics in the README

Otherwise we don't get all changes in the dir when running
tests or testing manually
This allows us to scrape multiple different sets of metrics
from the same pod, unlike pod annotations! This way, we don't
have to re-implement all metrics to come from one endpoint!
@yuvipanda yuvipanda requested review from agoose77 and sunu and removed request for agoose77 December 6, 2025 00:01
yuvipanda added a commit to 2i2c-org/prometheus-dirsize-exporter that referenced this pull request Dec 6, 2025
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant