Thursday, December 11, 2008

Know when your drives are failing, with smartd

from linux journal by mike diehl

"Ka-chunk... ka-chunk... ka-chunk... tick... tick... tick... Ka-chunk... ka-chunk..." That's just not a sound you ever want to hear coming from a hard drive. It's the sound of a hard drive trying to move it's read/write heads into a position that they don't seem to want to go to or its trying to read a sector that just isn't there anymore. Of course, modern hard drives have come a long way and are amazingly reliable, but if you work with computers long enough, you're bound to have one fail on you.

I know that a lot has been written about smartd and I was hesitant to pile on even more, but let's just say, for now, that I write about what I know about, and that lately, I've come to know a lot about disk drive failures...

Running smartd on your servers and workstations is just like performing backups; its something you know you should be doing, but probably aren't, or at least not regularly. I'm hoping that this article will help convince you that now's the time to start.

SMART, or Self-Monitoring, Analysis, and Reporting Technology is a capability that's built into almost all modern IDE and SCSI disk drives that allows the drive controler to minitor it's own state of health. What that means is that a SMART-capable drive can give you a clue that its about to fail. All you have to do is listen for these clues, and that's where smartd comes in. Smartd is part of the smartmontools package and runs as a daemon on your system. Periodically, smartd will poll your installed hard drives and, essentally, "ask them" how their doing. Smartmontools supports ATA/ATAPI/SATA-3 to -8 disks and SCSI, so smartd should be usable across a wide range of drives.

Performing a basic smartd configuration is almost rediculously simple to do. Most of us are using Linux distributions that have some sort of package management feature, so installing smartmontools should be fairly simple since the suite has been packaged for all of the major distributions. Once installed, configuration is as easy as it gets. The tool, by default, will discover and scan all of the drives in your system, and the default configuration file seems reasonable. Most of the time, I don't even bother to make changes to the /etc/smartd.conf file. You only need to change the configuration file if you need smartd to handle ill-behaved hardware, or if you need it to perform specific, non-default, functions. Finally, you have to arrange for your system to start smartd as part of the system start-up process. This part depends on which distribution you use. On my Gentoo boxes, I use:

rc-update add smartd default and I'm done.

Now that the suite is installed, and the daemon is running, we should find the occational log entry in our syslog. Here's what I see when I restart smartd:

Dec 1 16:50:32 localhost smartd[6219]: smartd received signal 15: Terminated
Dec 1 16:50:32 localhost smartd[6219]: smartd is exiting (exit status 0)
Dec 1 16:50:33 localhost smartd[2549]: smartd version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Dec 1 16:50:33 localhost smartd[2549]: Home page is http://smartmontools.sourceforge.net/
Dec 1 16:50:33 localhost smartd[2549]: Opened configuration file /etc/smartd.conf
Dec 1 16:50:33 localhost smartd[2549]: Drive: DEVICESCAN, implied '-a' Directive on line 23 of file /etc/smartd.conf
Dec 1 16:50:33 localhost smartd[2549]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Dec 1 16:50:33 localhost smartd[2549]: Problem creating device name scan list
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, opened
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, found in smartd database.
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, is SMART capable. Adding to "monitor" list.
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hdc, opened
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hdc, packet devices [this device CD/DVD] not SMART capable
Dec 1 16:50:33 localhost smartd[2549]: Monitoring 1 ATA and 0 SCSI devices
Dec 1 16:50:33 localhost smartd[2573]: smartd has fork()ed into background mode. New PID=2573.
Dec 1 16:50:33 localhost smartd[2573]: file /var/run/smartd.pid written containing PID 2573

This is what a successful smartd start looks like. As you can see, my workstation only has one hard drive that is SMART-capable and a CD/DVD drive that isn't. As long as the daemon is running, we'll see log entries indicating if the health status of my drive changes.

But we don't always read our logs. That's OK, the smartmontools suite has a command-line tool that you can use interactively to find out how healthy your drives are. For example, we can use smartctl to find out what type of drive we have:

# smartctl -i /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar SE family
Device Model: WDC WD3200JB-00KFA0
Serial Number: WD-WCAMR3566562
Firmware Version: 08.05J08
User Capacity: 320,072,933,376 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Dec 1 16:57:28 2008 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Of course, most of this isn't really interesting and only serves to expose just how old my system really is. (In the future, I'll strip out the heading information for further output) But we can also use smartctl to ask drive about it's general state of health:

# smartctl -H /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

At this point, though, we need to discuss what it means for the "overall-health self-assessment" to result in "PASSED." In practice, I'm finding, it doesn't really mean that much. What its telling you is that the IDE drive controller has detected problems. It does NOT mean that other problems don't exist, just that the IDE controller hasn't seen them yet. I got a passing result on a drive I knew to be bad. Still, this is a valuable check, because if the IDE controller has found a problem, you REALLY need to know about it.

A more thorough test result may be obtained by with the smartctl -t short /dev/hda command. As you might imagine, an even more thorough test result can be had by changing "short" to "long" in the command above. However, this command doesn't immediately return any results. It simply tells you to come back late and ask for the results:

# smartctl -t short /dev/hda
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Dec 1 17:15:50 2008

Use smartctl -X to abort test.

Well, after the test is complete, we can use the smartctl -l selftest /dev/hda command to see the results:

# smartctl -l selftest /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 14942 -

Here is where you see that the drive in my workstation is doing just fine. But in the spirit of spreading the misery, let's take a look at a few drives I have that aren't doing so well.

Here is an ominous message that I found in the log file of my MythTV server:

Nov 27 04:12:51 media smartd[6884]: Device: /dev/hda, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252

The neat thing about it all is that we use this server every day as a family and we've not noticed anything unusual. Still, the drive says it's about to fail. Guess what? We'll be replacing it soon. I'll probably duplicate the system and replace the drive in one operation and be back up in 30 minutes.

Here's another example. Two of the drives in my home fileserver are failing. I've backed them up and am waiting for the replacement drives to arrive. In the mean time, here is the result of the short selftest on one of them:

# smartctl -l selftest /dev/hdd
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 23678 200910
# 2 Extended offline Completed: read failure 90% 23676 200910
# 3 Short offline Completed: read failure 90% 23676 200910

As it happends, these drive problems were announced via syslog some time before I ever noticed a problem. Take a look:

Nov 27 03:34:10 dominion smartd[30161]: Monitoring 4 ATA and 0 SCSI devices
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdc, 463 Currently unreadable (pending) sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdc, 1210 Offline uncorrectable sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdd, 1430 Currently unreadable (pending) sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdd, 1429 Offline uncorrectable sectors

I've been watching the number of uncorrectable and unreadable sectors increase over the course of a month or so. And that's the point of this article. I knew my drives were failing long before they actually failed. A drive that has one or two bad sectors can be easily recovered or repaired. But this drive isn't going to last long. Because I was using smartd, and watching my logs, I was able to get my data backed up, consider my replacement options, and plan the replacement. The last thing I, or you, need, is to wake up one morning and find that an important server died during the night without warning.... Been there. Done that. Never want to do it again.