After I added some database targets to my Enterprise Manager 12c, an incident opened for failed jobs on a newly added database. We have some jobs scheduled to run every 5 minutes on that database, and of course we have some error-handling and alerting mechanism which will notify us if any of them fails, so there shouldn’t be any failed job. I checked run logs of all the scheduler jobs to see if there are any failed runs, and also old-style jobs if there are any broken one. Everything seems OK so I waited Enterprise Manager to pool failed jobs again but the incident stays open. Then I wondered how Enterprise Manager checks the failed jobs. I see that it uses the following query:
SELECT SUM(broken), SUM(failed) FROM (SELECT DECODE(broken, 'N', 0, 1) broken, DECODE(NVL(failures,0), 0, 0, 1) failed FROM dba_jobs UNION ALL SELECT DECODE('N', 'N', 0, 1) broken, DECODE(NVL(failure_count,0), 0, 0, 1) failed FROM dba_scheduler_jobs )
So instead of checking if any scheduler job recently failed, it just checks “failure_count” of scheduler jobs. Although those jobs haven’t failed for months, these counters are not reset automatically unless you disable/enable or recreate the job. So I have used DBMS_SCHEDULER.DISABLE and DBMS_SCHEDULER.ENABLE for the jobs which have failed once. It did reset failure_count(s) of the scheduler jobs and the incident is closed automatically.