Just and extract from the solaris fault manager
The fault manager associates one of the following states with every FMRI:
* ok: The resource is present and in use and has no known problems.
* unknown: The resource is not present or not usable but has no known problems. This might indicate the resource has been disabled or unconfigured by an administrator.
* degraded: The resource is present and usable, but one or more problems have been diagnosed.
* faulted: The resource is present but is not usable because one or more unrecoverable problems have been diagnosed. The resource has been disabled to prevent further damage to the system and requires human intervention.
Fault Manager Command-Line Tools
The Solaris implementation of the fault manager includes several command-line tools to observe and modify the behavior of fmd(1M) and its modules. The most common tools that the administrator will use are the fmadm(1M), fmdump(1M), and fmstat(1M) tools.
The fmadm(1M) utility can view, load, and unload modules and view and update the resource cache. It provides system administrators with a way to display every resource that fmd(1M) believes to be faulty. The most common fmadm(1M) subcommands (see the fmadm(1M) man page for complete details) are:
* config: Display the configuration, including the module name, version, and description of each component module.
* faulty [-ai]: Display the list of resources currently believed to be faulted. The FMRI, resource state, and UUID of the diagnosis are listed for each resource. By default, the fmadm faulty command only lists output for resources that are currently present and faulty. If the -a option is specified, all resource information cached by the fault manager is listed, including information for components no longer present in the system. If the -i option is specified, the persistent cache identifier for each resource in the fault manager is shown instead of the most recent state and UUID.
* load path: Load the specified module. The specified path must be an absolute path and refer to a module present in one of the defined directories for modules.
* unload module: Unload the specified module. The module name is that specified in the fmadm config output. The fault manager usually loads and unloads modules automatically based on the system configuration, so this command should be seldom used.
* rotate errlog | fltlog: Schedule a rotation of the specified fault manager log file. The log files are automatically rotated by an entry in the logadm(1M) configuration file that uses this subcommand.
The fmdump(1M) program enables the system administrator to view any log files associated with fmd(1M) and retrieve specific details of any diagnosis. By default the fmdump(1M) command shows the fault log, but will show the error log if given the -e command-line switch. The fmdump(1M) command can also take command line options to select only certain events (see the fmdump(1M) man page for complete details):
* -c class: Select events that match the specified class.
* -t time: Select events that occurred on or after the specified time.
* -T time: Select events that occur on or before the specified time.
* -U UUID: Select events that match the specified UUID.
Increasingly verbose output can be obtained for any command by specifying -v or -V.
The fmstat(1M) program is designed to report the statistics of the fault management system. If the -m module argument is given, fmstat(1M) reports statistics kept by the specified module. If -m is not specified, fmstat(1M) reports the following statistics for each module (see the fmstat(1M) man page for complete details):
* module: The name of the module as reported by fmadm config
* ev_recv: The number of events received by the module
* ev_acpt: The number of events accepted by the module as relevant to a diagnosis
* wait: The average number of events waiting to be examined by the module
* svc_t: The average service time, in milliseconds, for events received by the module
* %w: The percentage of time that there were events waiting to be processed
* %b: The percentage of time that the module was busy processing events
* open: The number of active cases owned by the module
* solve: The number of cases solved by the module since it was loaded
* memsz: The amount of dynamic memory currently allocated by the module
* bufsz: The amount of persistent buffer space currently allocated by the module
An Example of the Predictive Self-Healing Fault Manager
Once a CPU fault has occurred, the administrator might see this message on the console and logged to syslog:
SUNW-MSG-ID: SUN4U-8000-6H, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Sun Oct 17 14:15:50 PDT 2004
PLATFORM: SUNW,Sun-Blade-1000, CSN: -, HOSTNAME: myhost
DESC: The number of errors associated with this CPU has exceeded acceptable levels.
Refer to http://sun.com/msg/SUN4U-8000-6H for more information.
AUTO-RESPONSE: An attempt will be made to remove the affected CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU. Use fmdump
-v -u to identify the CPU.
The CPU state changes from ok to faulted, the processes using that CPU are terminated, and the CPU is taken offline. The state of the CPU can be viewed by using the psrinfo(1M) command:
0 on-line since 09/27/2004 16:57:30
1 faulted since 10/17/2004 14:15:50
Run the fmdump(1M) command listed in the fault message, using the EVENT-ID for more information on the fault. The output shows that CPU 1 has a problem and the component in Slot 1 needs replacing. The text Slot 1, indicating the location of the defective part, can be found silk screened on the motherboard.
fmdump -v -u 64fe6c23-12b7-ccd1-f0a7-b531941738f8
TIME UUID SUNW-MSG-ID
Oct 17 14:15:50.1630 64fe6c23-12b7-ccd1-f0a7-b531941738f8 SUN4U-8000-6H
FRU: hc:///component=Slot 1
Once a replacement CPU is delivered, the defective CPU from Slot 1 can be replaced and re-enabled.