Eddie is being slow/unresponsive to job submission requests

Unplanned; Complete

Overview

Affected services
  • eddie
At Risk Time
12:00 PM, 08-Jul-20 — 05:30 PM, 09-Jul-20
Description
UPDATE 15:40 - we have put a mitigation in place that seems to have stabilised the filesystem IO, so Eddie should now be considered AT RISK. Everything should be working as expected. Apologies for the interruption. We are experiencing unusually high load on Eddie at the moment and as a result job submission or other q* commands are failing. We are investigating and will have a response as soon as we can. Currently running jobs are unaffected but queued jobs will not start running until we return to normal.

Technical Information

Technical Information

Work summary
The eddie filesystem fs8 is experiencing unusually high load and we need to investigate the root cause.
Technical summary
Eddie is not responding to q* commands
Affecting
Servers / Hardware
eddie
Criticality
low
Impact description
Running eddie jobs are unaffected; queued jobs will not run until restored, new jobs cannot be queued.
Change Reference
Additional comments

Request, Authorise and Publish

Requested by Mike Wallis on 08-Jul-2020
Authorised by Research Services
Publish to IS Website (Service Status and Alerts page): Yes
Published by Mike Wallis to eddie-users@mlist.is.ed.ac.uk; cmvm-research-computing-group@mlist.is.ed.ac.uk


Additional Details

Risk classification
Risk assessment
Other people involved
Notes for future reference