Skip to content

[job_monitor] Watch and warn of jobs failing (almost) instantly by host

Philipp Schmidt requested to merge feat/monitor-broken-nodes into master

Description

Every so often we have an accumulation of failed jobs due to fundamental problem with a Maxwell node, typically caused by GPFS issues. This can have quite drastic consequences when there is heavy load on the calibration system, so the instant failure on such nodes causes SLURM to repeatedly schedule nobs to this node, just to continue failing quickly and accumulate even more jobs.

As discussed in Zulip, it is unfortunately tricky to detect such issues within the job. Therefore the idea is to instead look at unusually small runtimes on the job monitor side.

This MR is a first draft for this, counting the number of such low runtimes within a configurable time window. For now it's just printing a warning, but this mechanism could be used to (temporarily) blocklist such hosts for the webservice.

How Has This Been Tested?

Not yet tested in-place pending discussion, unit test for the used container type.

Relevant Documents (optional)

Types of changes

  • New feature (non-breaking change which adds functionality)

Reviewers

@kluyvert

Merge request reports

Loading