[job_monitor] Watch and warn of jobs failing (almost) instantly by host (!1134) · Merge requests · calibration / pycalibration

Philipp Schmidt requested to merge feat/monitor-broken-nodes into master Jan 21, 2025

Description

Every so often we have an accumulation of failed jobs due to fundamental problem with a Maxwell node, typically caused by GPFS issues. This can have quite drastic consequences when there is heavy load on the calibration system, so the instant failure on such nodes causes SLURM to repeatedly schedule nobs to this node, just to continue failing quickly and accumulate even more jobs.

As discussed in Zulip, it is unfortunately tricky to detect such issues within the job. Therefore the idea is to instead look at unusually small runtimes on the job monitor side.

This MR is a first draft for this, counting the number of such low runtimes within a configurable time window. For now it's just printing a warning, but this mechanism could be used to (temporarily) blocklist such hosts for the webservice.

How Has This Been Tested?

Not yet tested in-place pending discussion, unit test for the used container type.

[job_monitor] Watch and warn of jobs failing (almost) instantly by host

Description

How Has This Been Tested?

Relevant Documents (optional)

Types of changes

Reviewers

[job_monitor] Watch and warn of jobs failing (almost) instantly by host

Description

How Has This Been Tested?

Relevant Documents (optional)

Types of changes

Reviewers

Merge request reports