Why hard drives get broken…

Rate this item
(0 votes)

b2ap3_thumbnail_iStock_000005061774XSmall.jpgUnfortunately, I have to admit that modern storage media don’t ‘live’ long. In other words, faultless operation of any modern hard drive for over a year is rarity. The reason behind this fact is simple – there is a great competition on the market of hard drives and manufacturers simply have no time for thorough elaboration of technologies. Moreover, recently the requirements to hard drives have considerably increased (speed, noise, density, etc.), which brings the efforts of manufacturers to develop a modern and reliable storage device in a short time to nought. Due to this fact many companies refuse to provide a long-term warranty to their dealers and have reduced it to a year for some HDD families.

It should be unofficially mentioned that the policy of dealer companies is also being corrected according to chain reaction. Thus, for example, witnessing a huge warranty return of hard drives, many wholesale dealers have restricted their acceptance requirements and rejected warranty under smallest drive damage (often such damages do not affect drive’s operation, like a little plastic crack in the port or a scratch on enclosure or board). Therefore, all aforementioned negative factors deplorably affect the so-called end users.
It would seem that if the warranty has not yet expired and drive is not physically damaged, there shouldn’t be any reason to worry in case of drive failure, maybe except expenses connected with computer downtime. However often it happens that the information stored on drive is crucial and its loss, which is inevitable in the event of drive failure, is a real tragedy. It is especially if the drive failed in the accounting department and contained two-year balance which soon has to be submitted to the tax authorities.

Hard drive data recovery is a complicated technological process and the complexity depends on the level of malfunction. The process itself can involve difficult software and hardware techniques, including prompt replacement of electronic and mechanical HDD nodes. I would like to emphasize here that working with hard drive on a low physical level (in technological mode) requires special equipment and knowledge that will never be documented anywhere. So there is no sense in taking the drive to the service centre of some computer firm or give it to a person who is not specialist in the field of hard drives. You might think that the aforementioned looks like subvertisement, but such things have solid grounds to be true. The thing is that quite often users come to us with request to recover data or simply repair the drive with visible signs of unskillful attempt to repair it (sometimes these signs can be on a software level but they are still signs), although sometimes the client does not hide the fact that before coming to us he/she gave it to some firm. As a result of unskillful actions it is sometimes hard for our experts to bring the drive to normal operating condition. In case of a simple repair the damage from unskilled repairers is not too big and the centre can always buy a faulty drive to use it for spare parts. But in case it is impossible to recover data user can get really upset about the damage caused by his/her own negligence.

So, lets get into more details on how hard drives fail and how do we categorize malfunction in case you lose access to data. There are 4 malfunction categories.

1. So-called logical malfunction.

The drive is physically absolutely fine, it recalibrates upon initial power on, it doesn’t make unusual sounds, it is recognized in BIOS and low-level programs cannot find a single bad block. (To avoid possible terminology confusion, please note that ‘recalibration’ (here and after) means series of head positioning for initial setup of different mechanical parameters of the drive upon first power on, that can vary due to unstable ambient temperature. And, just in case, I would like to inform that recalibration in this particular context has no relation to Norton Calibrate designed to check disk surface and is old and destructive because it destructs data on the disk.)

In spite of that, operating system cannot boot from it, and if you make it a ‘second’ drive or boot from the diskette, you will not see partitions containing your data, for they are there (only as disk letters) but when catalogue reads it throws an error message or empty catalogue. Sometimes instead of all your folders you can see a single large file with a weird name.

Do not record anything to this drive to prevent possible loss due to misuse. I assure you that checking with programs like ScanDisk or NDD will not bring any result because they are designed only to work with logically correct partitions and can record to the disk for the purpose of correction, if you choose a relevant option of course. But after such changes took place it is extremely difficult to recover data, and there are many cases. Data recovery in this case is performed using software method by low-level recovery of partitions and file structures on the basis of profound knowledge of these structures. Anyway, diagnostics in our service centre is free, therefore, if data is really important to you – contact our specialist.

It is worth mentioning about hidden rocks. The thing is that some modern drives have a special mode on, under which the logical sector is automatically reassigned to working sector similarly to remap, if it was read with error (for example, drive may have surface defects). Thus you lose data from this sector without even noticing it. That is why try not to test your drive once again even for reading, although this recommendation might seem like a paranoia for some users. On the positive side, such effect is very rare and appears in a limited number of HDD models, for example, in some IBMs. But the purpose of this article is to give an account of main principles of working with drives and to warn users about possible failures – for his/her own benefit. The way to act is of course up to you.

Another unpleasant hidden rock are built-in in modern drives automatic S.M.A.R.T. tests. They launch independently (if SMART is on, and usually it is ‘on’ on most drives) and check the quality of disk surface and at the same time sector remapping can take place and data of such sectors can be lost. Overall, we do not recommend keeping SMART on, because there is no use anyway and a chance of failure or crash increases. This is evident from the results of studies carried out by our specialists on microprogram fragments of various HDD models.

And finally, lets try to answer the foreseeable question – “so how are you not afraid of testing the drive and reading information from it without loss in case of possibility of sector remap during read/write?” The fact is that if it is required, we can turn off remapping mode. It can be done using special technological commands. On top of that the process of data recovery usually involves slow copying to another working drive provided by the client, or, if not possible, by the center’s specialists on as agreed basis (as rule - for free).

2. Electronics malfunction

If disk controller board has visible damages (holes in chips, cleaved parts, etc.) then there is no need in further diagnostics since electronics should be repaired. Do not turn such drive on to avoid further damage or fire (very often local open fire breaks out due to overheated microchips.)

If there are no visible damages, then malfunction symptoms should be explained.

First and most popular one is the following: upon powering on nothing happens at all, the drive is silent and even spindle motor is not accelerated, or it tries to accelerate it but cannot reach the required speed. Here hides another rock – similar symptom can be caused by motor seizure or heads fell on disk and stuck to it (this happens practically with all modern drives because heads have been polished perfectly and there is a diffusion effect). From time to time we encounter tips in reviews on how to act in such situations, namely, they advise to rotate the disk in the direction of spindle rotation. We do not recommend to use such advice in practice. Firstly, in case of jammed disk it won’t help, because disk hardly rotates even by pliers and in case of stuck heads, damage of head suspension is possible, the slightest deformation of which will lead to complete inability to read normally.

Furthermore, when heads are detached from the surface, there is no air bag which causes intense friction damaging head slider and there are four dots visible on the disk which is a natural mechanical surface damage that might lead to head crash.

Second malfunction: disk accelerates normally but heads do not unpack – a typical click sound. Such thing happens rarely because the control of head positioning (servosystem) and three-phase generator for spindle motor are located on one crystal and if it fails, then it fails altogether. Also sometimes it happens that electronics has nothing to do with it and unpacking doesn’t occur because voice coil of actuator broke off. To eliminate such malfunction, you need to open the drive in the clean room and other hi-tech operations.

Sometimes it is vice versa – motor is not accelerated and drive makes a sharp vibrating sound or ticking. It is drive’s emergency system perceiving the fact that disk doesn’t spin-up due to stuck heads. That is why the system tries to unpack heads by cycling high current into actuator’s coil. Naturally, most probably it won’t manage because (see above) the heads on modern drives stick seriously and in this case the malfunction is simple – burnt output circuit (keys) of generator for motor rotation. And of course better not let the system rattle because it will mechanically damage head suspension and so forth. Sometimes it happens even if controller board is working but short-circuited turns in motor windings and the process is usually accompanied by vigorous heating of control microchips and keys.

It is worth mentioning that similar symptom is virtually safe if the hard disk drive uses AirLock-tm developed by Quantum. The key of the technology lies in the lock that doesn’t let the actuator to unpack until disk gains safe speed. Also, the lock helps in case of shock or drop. Nowadays airlock technology is used in the majority of hard drives.

Third malfunction: the disk works normally, recalibrates upon powering on, no unusual sounds, but it is not recognized in BIOS and the model name does not correspond to what is written on the drive or the name includes unreadable symbols. In such case it often happens that the main interface chip on circuit board is faulty. It is not recommended to write to such drive because you can damage data. BTW, quite often upon closer look at IDE you can see sunken or broken pins that are one of the signal conductors to interface. Recovery of pins in this case is not expensive and you get a working hard drive back.

And finally, fourth malfunction, connected with defect of microcircuit that degrade due to heat-generated expansion (temperature gradient). This malfunction manifest itself during heat up, in other words the drive works for some time and starts grinding, clicking or stopping the motor. Similar thing happens mostly with Quantum CX, LA, LB, LC and occasionally with Quantum LCT20, LM+, AS+ series. Sometimes microcircuit defect manifests on Fujitsu drives. Similar malfunctions are cured by microcircuit replacement following a special technique, accordingly, repair possibility depends on availability of live microcircuits in service centre.

It should be noted that in some models overheating symptoms do not necessarily indicate about controller malfunction or microcircuits on it. For example, Quantum AS+ often clicks and the reason behind it is gradually degrading switch due to overheating. Switch is a technical name of special microcircuit that is installed inside enclosure close to heads and is designed not only for switching, i.e. switching heads but also for signal pre-amplification. That is why this microcircuit is frequently called PreAmp in documents.

Second example – failing Fujitsu MPG series. Microcircuits in this case have nothing to do with malfunction, it is caused by a very cunning factory defect.

Also, such malfunctions happen in IBM drives, and again not all the fault lies on electronics.

Electronics failure overall summary – its either overheating or major power fails or poor quality power unit. Sometimes the reason is caused by physical damage of HDD’s circuit board, however statistics is low.

3. Destruction of service information.

One of the most common modern HDD malfunction. Service information recovery is carried out using software but it doesn’t mean that the work is less complex and time consuming. On top of that service information recovery techniques is different because you need to recover it on the basis of remaining parts of destroyed block-modules. In case of simple repair it is not necessary to piece together the service information because there is a special technology allowing to write a full set of similar service information, for example, from the live drive of the same model and launch the process of full factory self-testing for full calibration, otherwise due to settings discrepancy the drive would either stop working or it would be a complete data mess.
In case with data recovery the method described above is not acceptable. Only whatever has been destroyed is recovered and it is done manually, stage-by-stage, controlling the results every time, because this procedure simple cannot be automated. Sometimes recovery of some modules is really laborious, like recovery of adaptive settings (due to their loss or discrepancy the drive might not read or position itself, i.e. it might not see the servo information and as consequence it will be clicking with head stack), because such procedures can be properly adjusted and tuned only by built-in factory calibrator and during this process user data is erased since it is required to make a series of writes to the drive to perform current control and choose corresponding amplification of head signals. That is why manual adjustment is used and it works without need to write and therefore it is so long.

Very often a storage medium defect list gets damaged in service information. It is recorded in service area of every HDD and represents a list of coordinates of defect (bad) or unstable sectors or entire tracks of a certain drive to eliminate a chance of such sectors in user area. Lets remind, damage of service information is mainly caused by drive failures (or power failures) during writes, and frequently writing into service information is performed constantly, for example, during the update of S.M.A.R.T. parameters. That is precisely why we advise to TURN OFF it, although it doesn’t necessarily leads to the required results because the analysis of microcircuit fragments of some HDD models clearly points out to the fact that sometimes the drive still updates SMART service area even if it is turned off. Apparently it is a banal programmers’ fault.

Writing to defect list can occur at the moment of bad sector remapping. Accordingly, during unstable operation of system you can get unexpected results. For example, when defect list is damaged we lose the map of bad sectors location and if this list is zeroed or a new one is written on top of the old, it will be impossible to access data because many sectors or sector groups on the drive are mixed because initially the list is prepared at the factory (yes, every drive ALREADY has defects, even if it is brand-new). So, the initial data writes by user is performed taking into account initial list. And, of course, there is no need to tell that contents of such lists are unique for every HDD.

Now about symptoms of similar malfunctions. They may vary and we will consider only the main.

1. Abnormal recalibration upon powering on. The process is the following: drive reads vital service information to adjust for further normal operation. In case service modules are damaged the drive stops recalibration but doesn’t stop spindle motor. Access to such drive is possible only in technological mode, in user mode it will only display error identifying parameters in BIOS.

2. Recalibration is fine but model name or drive parameters are not real. At the same time the name contains unrecognisable symbols as it happens during malfunction of interface bus. For example, common case with flat Maxtor, when model during detect is named Maxtor ATHENA instead of 2В020Н1. Such model names are internal drive names among developers of drive firmware. Thus, if the drive gives such name, it means that it has switched to a special safe mode in which you can work only with the help of technological mode. Drives switch to safe mode only in case of service information malfunction.

3. Primary master hard disk fail – this inscription is displayed when computer starts and tells about inability to read zero sector with partition table from the drive. Usually at the same time recalibration is fine and the drive is correctly detected in BIOS, but due to defect lists destruction the drive has blocked the access to data.

Despite that we strongly advise not to temper with drive in case the data on it is important and not to take it to dubious companies, in this case the recovery of service modules occurs only with the help of software. In this case any user software experiments with drive are virtually safe because there are no programs that are available and that can allow accessing technological mode. Only some programs make an exception and you can check it on HDD manufacturer web-site. It chiefly concerns the so-called firmware updates, i.e. microcircuit that is part of service information and eliminates any shortcoming and error in drive’s program. Despite that these programs do not use technological modes, you still can indirectly re-write service area that can lead to unexpected consequences if this area was damaged before it. Although it has to be admitted that manufacturers build in thorough preliminary check of drive’s operability and in the majority of cases the program will refuse to update if there is at least one malfunction in the service area or in its electronics.

We also would like to warn that service are re-write occurs when UDMA modes are switched with special tools that many users post on their web-sites. Be careful. On top of that, switching UDMA mode is an operation that is not required even on working drives. Reasons behind it are simple. The thing is that initially, at the factory the drive is set to maximum transfer mode and if motherboard switches to lower mode it indicates that there are problems with chipset (old board or driver model), or that your drive is connected with 40-pin interface cable, if we are talking about UDMA mode 4 or mode 5 (UDMA66 and UDMA100 respectively). Drives that support such modes have to be connected with 80-pin cable (if motherboard supports these controller modes).

4. Physical damages of hard drive or its mechanical nodes.

Since the mechanical part of HDD at modern densities is quite tender, similar malfunctions are common and not always caused by normal wear and tear, though it happens. Below you will find most common malfunctions related to disk surface.

Bad blocks or bad sectors. As rule, the user finds out about it through the following. He/she decided to perform a preventive inspection of drive with the help of programs or by launching brand new, clean hard drive and then formatting it and getting a message about bad clusters or when without any reason system crashed and displayed an error telling about impossibility of further booting.

Sector is deemed as bad if hard drive controller responded with an error to read command and this error was registered either with BIOS functions if the drive is tested and works in OS, or by special tools that, as rule, work apart from BIOS, directly via controller ports. This error in its turn is often caused by ECC sector incompatibility. More rare – servo system failures. Abbreviation ЕСС is most probably not new to you and if you are frequent and old friend of computer circles. It stands for Error Correction Code and is a complex algorithm with the help of which a possibility of data preservance in the sector increases if, for example, insignificant part of it has been damaged. To perform this algorithm, every physical sector on the disk is in reality not 512 bytes but slightly more. In its turn, ЕСС is mainly caused by physical damage of disk area where this sector is located. It can be, for example, a radial segment trapped in the microparticle’s enclosure due to shock or degrading head that writes information and distorts signal.

Statistics shows that similar defects are not progressing. The reason behind it is simple – among all kinds of surface damages, the majority of them are not connected with formation of micro-relief on protective platter coating but represents areas with modified magnetic properties. That is why repair technology, i.e. hiding bad sectors, is based on entering physical coordinates of defect sectors in factory defect list with subsequent factory internal formatting. This is the main way. In practice we use the technology of factory self-testing that thoroughly checks the drive and registers not only current but also foreseeable defects into defect lists. It also clarifies, where possible, the reason of defect emergence and before post-testing the drive is being taken care of to avoid further failures.

If ЕСС appeared due to failure of servo system, it is far more serious. BTW in addition to ЕСС the controller shows also more serious signs that are not processed by BIOS and are traceable only in programs designated for low-level testing of HDD. Usually it means physical track damage or micro scratches on it and it is accompanied by strange sounds during positioning, like buzzing, gritting and clicking. At the same time it is possible to hide defective track but user is warned about possibility of future failures and crashes. Although everything depends on results of deep diagnostics. Sometimes it happens that similar defects are caused not by physical surface damages but detuning of adaptive parameters of calibration that, in their turn, can be recovered with the help of special tools. It is not possible to erase servo information with software without modifying drive’s microprogram. It cannot happen even due to power failure because the circuit is protected and you can erase servo markings only in case of hardware modification of electronics.

Whether to blow the whistle in case of bad sectors or not depends on deep diagnostics. If the hard drive contains vital data, we recommend to immediately BACKUP it to another storage medium. Another symptom of physical damage – displacement of stack of disks due to impact, i.e. exceeding maximum overload limit. At that, servo system fails and often it is not able to position itself on the track due to hitting and the drive can start clicking. Now we will refute a common delusion that in case of displaced stack of disks there is additional vibration that can be felt even by holding the drive in hands. It is not correct. The disc is fixed to the axis very tight, therefore you cannot displace it so that it beats even by hitting it with force. The width of track in modern drives is around 1 micron that is why even the smallest mechanical displacement is enough to misalign it and such misalignment cannot be felt by hands. Diagnostics of system detuning in this case is carried out only with electronic devices although some HDD model allow program control (if the drive is ready and doesn’t click).

Symptom No 3 – head crash. Many heads represent complex micro structures and in most cases write coils are performed by micro-etching and read element – magneto-resistive, also consisting of deposition of composite material-conductor. Due to operating under increased temperature mode, landing friction and other factors, the structure is vulnerable and crashes, sometimes gradually. Therefore, the more heads your drive has the more chances that one will eventually crash.

Defect manifests in a simple inability to read or write and unlike local physical defects the inability to read is manifested all over the surface. If the drive has more than one head, switching arrangement at serial reads varies but generally it is performed snake-wise, i.e. first track 0 from head 0 and then head 1 is being read and then track 1 from the same head and only then track 1 from 0 head and so on in cycle until there are no tracks left :) Thus, in case if one head crashed and another one is working the surface test will be displayed in fits and starts.

Sometimes not the head is damaged but its slider – bearing, directly contacting with disk surface during landing and its shape is made in such a way that it creates an air bag with the required space and head does not touch the disk, otherwise the disk will be instantaneously damaged. Slider might hit the landing area. To bring down the friction a special polymer lubricant or micro relief with the help of laser are applied to this area. That is why you should be extra careful if carrying hard drive in a pocket or a bag because when walking the drive works like a pendulum, gradually grinding off the slider, although insignificantly. Anyway such thing cuts its lifespan. In view of the above there are drives in which head landing system is arranged in such a way that it lands beyond the disk onto special holder-lock and if the disk is turned off the head does not contact the surface at all. Such mechanism takes its origin from NoteBook drive, because they have to constantly ‘put it to sleep’ to cut energy consumption. That is why transporting such drives is safe. Such mechanism is present in virtually all IBM models.

In the sleeping mode the drive practically does not consume electricity and does not make sounds because after the operating system sends a relevant command the drive lands its heads and stops spindle motor. As we have already mentioned above, this mode is necessary in stand-alone devices. In personal or table-top systems the drive are not adjusted to this mode and it is worth mentioning that they are not well with it. Mainly it is related to peculiarity of head landing mechanism and its electronic control. In a standard mode upon power on the electronics of HDD uses kinetic energy along with motor being a current generator, using that force the head moves to safe area, i.e. it lands. At this, platters stop quite fast and do not make substantial damage to heads due to friction. In case of motor program halt it happens during hibernation, electronics simply turns off the motor and lands heads with current on power bus, that is why the motor continues spinning (2-3 times longer than in standard mode) and, accordingly, heads wear and tear increases. That is why we do NOT recommend using energy saving mode and turn it off after installing Windows that turns it on by default. Furthermore, the shape of slider when grinded off might change and during flight it can roll leading to progressing physical scratch on the disk. That is precisely why we do not recommend turning on failed drive if it contains important data. It can happen that the next time you turn on your failed drive it won’t calibrate and start clicking.

There is one more malfunction during which the drive roars and vibrates and sometimes jams and doesn’t spin at all. It can be caused by shock or manufacturing defect. Often in such cases bearings are destructed or their balls contain dents. As rule, such malfunctions are not repairable. But in case of data recovery specialists have a technology to do it, they move the disk to working enclosure and calibrate its location. Although such procedure is difficult and expensive. The same can be mentioned about heads replacement – working donor-disk is paid for by the client.

Also, all drives often have the following symptom – the drive makes loud monotonous clicking at unpacking or when accessing certain disk areas. In latter case we can state that there are physical damages on the surface or scratches that can progress upon every power on. In the former case we cannot categorize the problem for sure without deep diagnostics. Clicking develops because the head cannot position itself on the track and continues searching for signal, reaching as far as it can go and makes sounds. There are several reasons why the servo system cannot find the signal. The following are main reasons arranged in priority order:

1. Malfunction of head read element. It can result from head crash or its slider crash. Previously in old drives it happened that head slider was polluted by ferromagnetic composition from disk’s coating and the head was losing its aerodynamic qualities. At this, the air space between slider and surface changes and the signal is corrupted.

2. Malfunction of heads switch – microcircuit located on the actuator. Such can happen in two cases. First – power supply malfunction (computer power unit). Even short increase of power voltage of 1.5 times is sometimes enough for switch to fail. Controller fails along with it. Second case – long overheating, i.e. operation without cooling. This usually happens during hot seasons. When used intensively drive enclosure can overheat (varies in different HDD models) due to controller board microcircuits and due to overheating of voice coil actuator.

3. Non-conformity or loss of adaptive settings or physical configuration of the drive. Such mainly happens if settings are recorded in energy-dependent memory in drive’s controller and the original controller has been lost or failed. It also happens with Fujitsu MPG or any IBM. Sometimes you can get clicking effect if rewrite service information incorrectly or forget carrying out full calibration and special factory tests.

4. Misalignment, i.e. stack of disks displacement. See above.

5. Controller malfunction. According to statistics it happens rarely and can be easily diagnosed with temporary controller replacement, though it makes no sense for some HDD models and is even dangerous.

And finally, I would like to call upon users NOT TO OPEN hard drive enclosure, because in most cases opening hard drive without special tools drastically damages drive’s performance and makes it unrepairable. Some drive crash immediately after the cover is removed.

So, we have considered main problems with modern hard drives on hard magnetic disks and some ways of preventing them. Absolute majority of recommendation is applicable to old drives as well.

Short summary. In case you notice signs of HDD malfunction, the first thing to do is to backup the data to another storage medium or, if it is not possible, do not turn the drive on before it will be diagnosed by specialists who KNOW what they do. And despite the fact that such simple advices are not observed in view of human psychology peculiarities – we sincerely wish you to never have any of the above problems and if it happened we are always at your service.

Last modified on Thursday, 21 May 2015 20:50
Data Recovery Expert

Viktor S., Ph.D. (Electrical/Computer Engineering), was hired by DataRecoup, the international data recovery corporation, in 2012. Promoted to Engineering Senior Manager in 2010 and then to his current position, as C.I.O. of DataRecoup, in 2014. Responsible for the management of critical, high-priority RAID data recovery cases and the application of his expert, comprehensive knowledge in database data retrieval. He is also responsible for planning and implementing SEO/SEM and other internet-based marketing strategies. Currently, Viktor S., Ph.D., is focusing on the further development and expansion of DataRecoup’s major internet marketing campaign for their already successful proprietary software application “Data Recovery for Windows” (an application which he developed).

1 comment

Leave a comment

Make sure you enter the (*) required information where indicated. HTML code is not allowed.