Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#255402 - 03/05/2005 19:02 disk problems & fsck output
Thunder
new poster

Registered: 20/02/2002
Posts: 14
I've been tracking down intermittent disk problems, both in syncing, and in 'No Hard Disk Found" errors. I believe I solved the latter by following the guidlines on re-crimping the IDE cables. I haven't had any of those errors in about two weeks.

The problem I have now is that the disk check fails, and I followed the fsck instructions listed here and on the second disc (/dev/hdc4') I get a bunch of errors. I'm running 2.00 developer release. Does the following output suggest that my hard disk is bad?

empeg:/empeg/bin# fsck -fay /dev/hdc4
Parallelizing fsck version 1.19 (13-Jul-2000)
e2fsck 1.19, 13-Jul-2000 for EXT2 FS 0.5b, 95/08/09
ext2fs_check_if_mount: No such file or directory while determining whether /dev/
hdc4 is mounted.
Pass 1: Checking inodes, blocks, and sizes
hdb: irq timeout: status=0xd0 { Busy }
ide0: reset: success
hdb: irq timeout: status=0xd0 { Busy }
ide0: reset: success
hdb: irq timeout: status=0xd0 { Busy }
end_request: I/O error, dev 03:44 (hdb), sector 13369389
hdb: status timeout: status=0xd0 { Busy }
hdb: drive not ready for command
ide0: reset: success
hdb: irq timeout: status=0xd0 { Busy }
ide0: reset: success
hdb: irq timeout: status=0xd0 { Busy }
ide0: reset: success
hdb: irq timeout: status=0xd0 { Busy }
end_request: I/O error, dev 03:44 (hdb), sector 13369389
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
hda: lost interrupt
.
.
.

and it just kept going like this forever (till I pulled the plug).

Any thoughts? Bad Disk?
thanks,
Thunder


Edited by Thunder (03/05/2005 19:06)

Top
#255403 - 03/05/2005 19:04 Re: disk problems & fsck output [Re: Thunder]
pgrzelak
carpal tunnel

Registered: 15/08/2000
Posts: 4859
Loc: New Jersey, USA
Well, not necessarily that the disk is bad. Either the cable or the IDE header might be, though. If you have already had to crimp the cable once, you might want to replace it.
_________________________
Paul Grzelak
200GB with 48MB RAM, Illuminated Buttons and Digital Outputs

Top
#255404 - 03/05/2005 19:11 Re: disk problems & fsck output [Re: Thunder]
Thunder
new poster

Registered: 20/02/2002
Posts: 14
Hmm... both problems had been going on regularly, and I only just re-crimped the cables. The 'No hard disk found' errors are now gone. could I reverse the drives, and see if the problem still exists on the same disk? would this rule out the cable? (being that the first disk, hda4 doesn't have any problems with fsck)

thanks,
Thunder

Top
#255405 - 03/05/2005 21:51 Re: disk problems & fsck output [Re: Thunder]
tfabris
carpal tunnel

Registered: 20/12/1999
Posts: 31571
Loc: Seattle, WA
If you're sure the cabling problem is solved, it might be time to completely wipe the disks with the builder image and start over from scratch. A long period of intermittent disk failures might cause enough corruption that FSCKs won't do the trick.
_________________________
Tony Fabris

Top
#255406 - 03/05/2005 21:56 Re: disk problems & fsck output [Re: tfabris]
Thunder
new poster

Registered: 20/02/2002
Posts: 14
Alright, thanks for the suggestion. I can't know for sure that the cabling problem is fixed, but I was getting those errors every couple days, and I haven't had any of them for two weeks now.

Will probably give it a few more days, and try rebuilding this weekend.

Thanks for the input.
Thunder

Top
#255407 - 04/05/2005 00:06 Re: disk problems & fsck output [Re: Thunder]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
In my experience (albiet somewhat limited to only about 20 units thus far), I've never found a bad cable. It has *always* been the faulty solder job on the IDE headers.

Cheers

Top
#255408 - 04/05/2005 04:42 Re: disk problems & fsck output [Re: mlord]
Thunder
new poster

Registered: 20/02/2002
Posts: 14
Well I did inspect the IDE solder joints, and all appeared to be OK. Didn't get down and dirty with the magnifying glass or anything, but as far as I could tell they were OK.

Does the output definitely seem to be corruption of some sort, and not a faulty sector on the drive?

thanks,
Thunder

Top
#255409 - 04/05/2005 04:48 Re: disk problems & fsck output [Re: Thunder]
tfabris
carpal tunnel

Registered: 20/12/1999
Posts: 31571
Loc: Seattle, WA
Actually, "hda: lost interrupt" usually means hardware trouble with the cable, the connectors, or the controller.

I take back what I said about doing the builder image. You're still having disk hardware trouble and doing FSCKs and builder images isn't going to improve that.
_________________________
Tony Fabris

Top
#255410 - 04/05/2005 10:59 Re: disk problems & fsck output [Re: Thunder]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Definitely a faulty hardware connection somewhere.

Top
#255411 - 04/05/2005 22:06 Re: disk problems & fsck output [Re: tfabris]
Thunder
new poster

Registered: 20/02/2002
Posts: 14
OK, updates.... I reversed the drive locations on the cable (but not the jumper/master/slave) and re-ran the fsck tests. Same errors, same drive (hdc). This led me to believe it may be the drive and not the cable. I also re-inspected the IDE header, and all looks good.

So, I ran the builder image (on each drive individually, in turn) and after running the format twice on each disk, got a format to complete successfully, and stress test returned no errors after ~30 minutes or so on each disk.

Next, I re-installed developer kernel 2.0, and re-ran fsck calls. Both disks ran without error! YAY right? hmm.. so I had read about the smartctl utility, and still suspect my hard disk may have a bad sector, so I installed Hijack 426, and copied the smartctl tool over.

All tests described here on HDA completed sucessfully, however I get the following error output on HDC:


empeg:/drive0/var# ./smartctl -s on /dev/hdc
smartctl version 5.33 [arm-empeg-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Warning! Drive Identity Structure error: invalid SMART checksum.
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

empeg:/drive0/var# ./smartctl -l error /dev/hdc
smartctl version 5.33 [arm-empeg-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Warning! Drive Identity Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 4

ATA Error Count: 0
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
51 0e 03 00 ce 89 dd at LBA = 0x0d89ce00 = 227134976

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
40 20 0c 02 00 00 00 50 12d+10:16:38.403 READ VERIFY SECTOR(S)
00 00 00 00 00 00 00 00 00:00:56.579 NOP [Abort queued commands]

Error -1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
ce 89 f0 51 03 00 ce Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 ce 50 20 8b 05 00 89 00:00:16.384 NOP [Reserved subcommand]
00 00 00 00 00 00 00 00 00:00:00.000 NOP [Abort queued commands]

Error -2 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
dd 00 89 ce f0 51 03

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
02 dd 89 ce 50 20 ae 03 12:29:17.696 [RESERVED]

Error -3 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 03 dd 89 ce f0 Error: UNC at LBA = 0x00ce89dd = 13535709

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
20 00 03 dd 89 ce 50 00 6d+06:35:14.519 READ SECTOR(S)

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 03 ce 89 dd 0e 8a

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
dc a6 03 00 00 40 04 c9 42d+17:21:42.537 BOOT POST-BOOT [RET-4]
dc 00 00 00 00 00 04 00 42d+17:21:42.537 BOOT POST-BOOT [RET-4]

empeg:/drive0/var#



SO, any ideas?
Seems odd about the "Warning! Drive Identity Structure error: invalid SMART checksum." message? was it a download problem on the executable?

what do you think now? still a cable or other problem?

I'm going to try another hard disk, and go back through the builder/developer/fsck/smartctl tests again...

thanks for all the support guys!
Thunder

Top
#255412 - 05/05/2005 00:55 Re: disk problems & fsck output [Re: Thunder]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Those messages may be due to the disk simply not understanding/supporting the S.M.A.R.T. commands.

What does smartctl -a /dev/hd? show?


Cheers

Top
#255413 - 05/05/2005 15:42 Re: disk problems & fsck output [Re: mlord]
Thunder
new poster

Registered: 20/02/2002
Posts: 14
Both disks are the 'stock' 30GB Fujitsu disks.
Here's the output from 'smartctl -a' for the first disk (/dev/hda)

empeg:/drive0/var#./smartctl -a /dev/hda
smartctl version 5.33 [arm-empeg-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Warning! Drive Identity Structure error: invalid SMART checksum.
=== START OF INFORMATION SECTION ===
Device Model: FUJITSU MHL2300AT
Serial Number: 01014825
Firmware Version: 3022
User Capacity: 30,005,821,440 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 5
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sat Sep 11 16:54:27 2004 /usr/local/armtools

==> WARNING: This drive's firmware has a harmless Drive Identity Structure
checksum error bug.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 364) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x100b 100 099 032 Pre-fail Always - 161353
2 Throughput_Performance 0x0005 100 100 020 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 2
4 Start_Stop_Count 0x0012 065 065 016 Old_age Always - 21077
5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 090 020 Pre-fail Always - 665
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 0
9 Power_On_Seconds 0x0012 097 097 020 Old_age Always - 581h+54m+49s
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0
12 Power_Cycle_Count 0x0012 074 074 020 Old_age Always - 4048
196 Reallocated_Event_Count 0x0033 100 100 024 Pre-fail Always - 0
198 Offline_Uncorrectable 0x0010 100 100 020 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000b 100 100 020 Pre-fail Always - 16463
203 Run_Out_Cancel 0x0002 097 097 020 Old_age Always - 416612089872

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 581 -
# 2 Short offline Completed without error 00% 580 -

Device does not support Selective Self Tests/Logging
empeg:/drive0/var#


AND output from drive 2 (the one with the errors, /dev/hdc )

empeg:/drive0/var# ./smartctl -a /dev/hdc
smartctl version 5.33 [arm-empeg-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Warning! Drive Identity Structure error: invalid SMART checksum.
=== START OF INFORMATION SECTION ===
Device Model: FUJITSU MHL2300AT
Serial Number: 01012327
Firmware Version: 3022
User Capacity: 30,005,821,440 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 5
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sat Sep 11 16:56:57 2004 /usr/local/armtools

==> WARNING: This drive's firmware has a harmless Drive Identity Structure
checksum error bug.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 1) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 364) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x100b 100 091 032 Pre-fail Always - 85110
2 Throughput_Performance 0x0005 100 100 020 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 2
4 Start_Stop_Count 0x0012 051 051 016 Old_age Always - 29040
5 Reallocated_Sector_Ct 0x0033 099 099 024 Pre-fail Always - 3
7 Seek_Error_Rate 0x000b 100 100 020 Pre-fail Always - 3001
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 0
9 Power_On_Seconds 0x0012 061 061 020 Old_age Always - 5865h+54m+12s
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0
12 Power_Cycle_Count 0x0012 074 074 020 Old_age Always - 3953
196 Reallocated_Event_Count 0x0033 099 099 024 Pre-fail Always - 3
198 Offline_Uncorrectable 0x0010 100 100 020 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000b 100 099 020 Pre-fail Always - 3102
203 Run_Out_Cancel 0x0002 096 096 020 Old_age Always - 416610910203

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 4

ATA Error Count: 0
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
51 0e 03 00 ce 89 dd at LBA = 0x0d89ce00 = 227134976

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
40 20 0c 02 00 00 00 50 12d+10:16:38.403 READ VERIFY SECTOR(S)
00 00 00 00 00 00 00 00 00:00:56.579 NOP [Abort queued commands]

Error -1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
ce 89 f0 51 03 00 ce Error: ABRT

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 ce 50 20 8b 05 00 89 00:00:16.384 NOP [Reserved subcommand]
00 00 00 00 00 00 00 00 00:00:00.000 NOP [Abort queued commands]

Error -2 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
dd 00 89 ce f0 51 03

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
02 dd 89 ce 50 20 ae 03 12:29:17.696 [RESERVED]

Error -3 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 03 dd 89 ce f0 Error: UNC at LBA = 0x00ce89dd = 13535709

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
20 00 03 dd 89 ce 50 00 6d+06:35:14.519 READ SECTOR(S)

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 03 ce 89 dd 0e 8a

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
dc a6 03 00 00 40 04 c9 42d+17:21:42.537 BOOT POST-BOOT [RET-4]
dc 00 00 00 00 00 04 00 42d+17:21:42.537 BOOT POST-BOOT [RET-4]

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


Device does not support Selective Self Tests/Logging
empeg:/drive0/var#

Top
#255414 - 05/05/2005 19:43 Re: disk problems & fsck output [Re: Thunder]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
How about:

smartctl -t long /dev/hdc

Top
#255415 - 05/05/2005 20:36 Re: disk problems & fsck output [Re: mlord]
Thunder
new poster

Registered: 20/02/2002
Posts: 14
I think it's all a mute point! I installed another 30 GB disk in place of the one that's been acting up. Formatted it, installed the developer image, and everything has checked out fine.

I'm going to chalk this one up to a bad disk I think.

Thanks for all the help you guys.

Top
#255416 - 05/05/2005 20:54 Re: disk problems & fsck output [Re: Thunder]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
"moot"
_________________________
Bitt Faulk

Top
#255417 - 05/05/2005 22:26 Re: disk problems & fsck output [Re: wfaulk]
Thunder
new poster

Registered: 20/02/2002
Posts: 14
You say "potato", I say "tomato"

Top
#255418 - 05/05/2005 22:58 Re: disk problems & fsck output [Re: Thunder]
tman
carpal tunnel

Registered: 24/12/2001
Posts: 5528
I say Solanum Tuberosum and Solanum Lycopersicum

Top