Workshop on reanimating UNIX-like operating systems: methods of failure control in Linux and FreeBSD

Monday, 18 May 2015 00:00

font size decrease font size increase font size

Workshop on reanimating UNIX-like operating systems: methods of failure control in Linux and FreeBSD

Rate this item

(0 votes)

UNIX-like operating systems are designed so that if they break, they do not try to repair themselves and honestly let you know what happened. What then happens to an operating system depends on the level of skills of the computer owner: a newbie immediately decides to reinstall, while a seasoned Linux-users at ease boots LiveCD, types a few commands in the data terminal and restarts the computer, smiling ironically. The UNIX's design is so simple and straightforward that the OS can be reanimated, in whatever state it is found.

In total there are six classes of problems Linux-users usually face:

Boot. Zapped MBR, forgotten root password.
Equipment. OS hanging up and spontaneous rebooting, kernel panic.
HDDs. A zapped partition table, hard disk failure.
Graphics subsystem. Incorrect xorg.conf setting, a missing video driver, brakes.
Drivers. Everything associated with unrecognized hardware.
Network. Incorrect network interface configuration, failed DNS-resolving.

We will consider ways to deal with each of these problems.

When GNU Linux refuses to boot

A problem of the MBR zapped by the boot loader of another operating system has already been forbidden to discuss on many online forums, appeared in numerous FAQs and become a pain in the neck to experienced users. There is no a newbie to Linux in our country who has never faced this problem. Meanwhile, the solution is very simple: just boot any Linux LiveCD, open a terminal window and type the sacramental command:

$ sudo grub-install /dev/sda

In most situations, this command will be enough to restore the boot loader. But if grub-install is barfing instead of thoughtful silence - things look blue! We'll have to execute the grub command line:

$ sudo grub

The "find / boot/grub/stage1" command, typed in response to grub's invitation, should display the disk partition name containing /boot/grub directory. Then everything is simple:

: root disk_volume

: setup (hd0)

: quit

FreeBSD-users suffer far more rarely from the problem of the zapped loader, but this sometimes happens to them, too. Algorithm for MBR recovering in this case is somewhat different:

Boot the primary or the recovery disk of FreeBSD.
Select "Fixit" in the menu, then "CDROM / DVD".
Type "boot0cfg-o packet ad0 && exit" in the command line.
Click the Reset button on the system unit.

That's all over and done with the MBR. Now let's talk about the forgotten root password. How much indeed UNIX-users like to think up long, tangled passwords, and then successfully of forget it! And how they rejoice when they hear that it is enough to follow just two simple steps to recover a password. Namely - to boot a single-user mode and remove the password from the user base using the vipw command.

In Linux, entering a single-user mode is done by transferring the single option to the kernel. Select a required menu item in grub, press 'e', add the word "single" at the end of the emerged line and press <Enter>. Kernel successfully boots and runs /bin/sh sudo. Execute the vipw command, delete the asterisk in the root password line, exit the editor and type "exit".

To enter a single-user mode in FreeBSD, you must press '4' in the boot menu or type "boot-s" in the boot command line.

Hardware problems

The kernel often fails to boot or operate correctly due to poor implementation of ACPI in motherboard chipset or BIOS. OS developers are tired of debating on it, Linux and FreeBSD kernels contain dozens or even hundreds of workarounds for motherboard having such annoying peculiarity. However, it is obvious that a certain time interval passes from the moment of the motherboard's coming into market and to when errors are detected, so do not hope that your recently bought buggy ASUS is already in the black list of the kernel.

Problems with ACPI and its important part IO-APIC can show up in different ways: periodic OS hangs, dead keyboard and mouse, the kernel message "MP-BIOS bug: 8254 timer not connected to IO-APIC", but most often the "hardware bug" shows up at the OS installation phase. The installer simply freezes when copying files.

Fortunately, it is easy to bypass through deactivating APIC and/or ACPI in the kernel. For Linux, you need to select the required menu item in the grub boot loader, press 'e', to add the word "noapic" at the end of the emerged line and press 'b'. To log changes, open the file /boot/grub/grub.conf and add "noapic" in all lines beginning with the word "kernel". If that does not work, completely disable ACPI using the option "acpi=off".

For FreeBSD, it is enough to press '2 'when the boot menu appears, and then log the changes by adding the line "hint.apic.0.disabled=1" in loader.conf:

# echo "hint.apic.0.disabled=1" >> /boot/loader.conf

Periodic operating system hangs or permanent kernel panic may indicate that the internal memory is on the way out. If hangs happen once every hour or half an hour, only some plank slots probably failed. In the event of the entire memory module failure, at the very next boot the kernel panic will occur!

It is not difficult to check if the memory is failing. The most elementary way is to pack and unpack a large data chunk, for example, the kernel source-code-tree:

$ tar -czf ~/src.tar.gz /usr/src && tar -xzf ~/src.tar.gz

Bad memory cells will cause conflicts in the hash verification, and the compress will report it without delay.

Another (more correct) way to check is to use professional memtest86 tool. This is a self-contained utility that does not require an operating system for work. It has been in the grub menu of many distribution packages and LiveCD Linux. Just restart your machine and select memtest86. The memory test will start automatically.

The memtest86 tool uses many different testing algorithms, so the check may take a long time. I recommend you to run memtest86 for the night, go to sleep and in the morning see if there are red lines in the output, giving signals of bad cells.

Frequent spontaneous machine rebooting, especially when running heavy applications or games, is a result of overheating of the processor core or graphics card. Check the operability of coolers and, if necessary, replace them. If there is no time for this procedure, and the work must go on, try to reduce the processor core frequency or graphics card frequency.

Many modern processor cores and motherboards allow changing the processor speed on-the-fly, without rebooting the computer. A special interface is usually provided for this, which is located inside the /sys directory in Linux or in a sysctl branch in FreeBSD.

The cross-platform nvclock utility is commonly used to change the frequency and other characteristics of the video-processor. Run it with '-s' indicator to find out the current GPU frequency:

# nvclock -s

And then reduce it by about 100 MHz:

# nvclock -n 300

HDD Failure

Experience users know that the range of issues, associated with the use of hard drives, is very wide and extends from mechanical impact damage to the accidentally zapped partition table. In some cases the hard drive can still be reanimated, but in most cases it is either already dead or close to death.

In order not to put your foot in it, experts advise you to perform periodic hard drive checks using utilities for displaying SMART statistics, the special chip embedded in the hard drive. *nix-systems also have such tools, the most known of which is called smartctl.

The smartmontools package, containing the smartctl utility, is pre-installed in almost any Linux distribution package, and is availabe through ports system (sysutils / smartmontools) in FreeBSD.

Let's run smartctl:

# smartctl -A /dev/sda

In the table that appears on the screen, we appeal to only two lines: Reallocated_Sector_Ct and Temperature_Celsius. The last column of the first of them shows the number of reallocated sectors. A value other than zero indicates a problem. The disc begins to fail, and the number of reallocated sectors will continue to grow. The last column of the Temperature_Celsius line contains the current temperature of the hard drive, which should not exceed 50 degrees (36-45 degrees is ideal conditions).

S.M.A.R.T. values are just the numbers that are not always in communication with the hard drive`s real state. Moreover, a research conducted by Google showed that the probability of drive`s death has nothing to do with S.M.A.R.T. values in 60% of cases, and the only more or less reliable indicator is the number of reallocated sectors.

But what if the drive is almost dead, and the data can not be retrieved because of repeated reading errors or moving the head? When you try to copy files kernel will fill dmesg with I/O error messages, and the cp command will simply return an error. The first step is to try to unmount the partition and move the data to another hard drive using dd (hereafter /dev/sda is the failing drive, /dev/sdb - the new drive):

# dd if=/dev/sda of=/dev/sdb conv=noerror,sync

If the number of bad sectors on the disk is small, then dd will copy the disk, filling problem areas with zeros. After this we will only run fsck for all file systems and work with the new drive.

Unfortunately, using dd does not always work. In some cases, the disc is so damaged that bad areas stretch for many hundred thousand or even millions of sectors on end! You will have to wait for completion of dd executing for a few days during which the test hard drive might easily die.

The best minds of the world advise to use a special dd_rescue utility, by which you can be copy a drive in two directions: first pass directly from the beginning, the second one - starting from the end. As a result, the new disk will contain all the data except for the problem area.

Let's make the first pass:

# dd_rescue -v -y 1G -l sda.log -o sda.bb \

/dev/sda /dev/sdb

When the disc starts rustling madly, press <Ctrl+C> to finish copying, and start the copy process from the end:

# dd_rescue -r -v -y 1G -l sda.log -o sda.bb \

/dev/sda /dev/sdb

Stop the copy process after a long rustling and disconnect the dying drive.

Another problem is the loss of partition tables, which until quite recently has been solved using a hex editor. Today it is easier to use gpart utility:

# gpart -W /dev/sda /dev/sda

testdisk - an alternative to gpart - is more powerful and flexible utility with a pseudo-graphical interface.

The moods of Mr. X

X.org has recently become smarter by an order of magnitude, and problems with it are no longer a serious obstruction. Now the X-server can automatically detect the input device, select the correct resolution and refresh rate of the monitor. You don't have to configure it at all in many distribution packages, the installation scripts generate the proper configuration by themselves.

But X-server glitches from time to time. Moreover, a user or a packages update system are often to blame. If after booting you see a boring black console instead of the usual login window, then the server startup process completed with errors. There could be a hundred reasons for this, ranging from the missing of drivers and up to the problems with the /tmp directory. The most sensible thing to do is to try to restart the X-server by the startx command and see what errors it will return on the screen. In most cases this is enough to diagnose the problem, but if the causes of the failure remain a puzzle, you should refer the file /var/log/Xorg.0.log for a detailed explanation:

# grep EE /var/log/Xorg.0.log

Recording logs, X-server marks all errors with the "(EE)" marker, so that the above command will return only those records that indicate problems.

If you feel that you are not able to correct mistakes by yourself, just execute "X-configure" command, which will generate a new configuration file X.org.

In addition to failures, X-server may simply drag feet. In this case, a video-driver is to blame, not a user or distribution package. Modern graphics toolkits and some desktop environments (e.g. KDE4) pass graphics rendering on to the graphic accelerator. This results in bad performance in systems, whose video-drivers does not support 2D/3D acceleration. In particular, the standard NVIDIA nv-driver. To solve the problem, go to nvidia.com and download the latest driver for your operating system or do the same through a package management system.

A missing driver

Modern versions of Linux and FreeBSD are filled to capacity with drivers, even for the most exotic equipment. The days, when we had to fit PC configuration for these operating systems, have passed. Today, Linux and FreeBSD are easily installed to any modern server, home computer or laptop and do not require special configuration. The only snag is that drivers for "hardware novelties" appear with some delay, which is quite natural, but annoying.

If your newly bought hardware does not display operability, it means the kernel have not engaged it when booting. This can happen in two situations: either the driver have not been loaded during system initialization, or the driver for this device is not in the kernel or module. In any case, we should appeal to PCI expansion slot for found devices and loaded drivers. You can use the lspci utility in Linux or pciconf utility in FreeBSD:

linux# lspci -v

freebsd# pciconf -l -v

On the screen, you will see all devices found during kernel initialization and modules (drivers) assigned to them. In the first case, the module name will be displayed in the "Kernel modules:" line, in the second case - in the first line of each device.

In my example, the word "nfe0" in the very beginning shows the device name (NIC) and nfe driver assigned to it. If you will see the word "none" instead of the name, it means that the kernel hasn't loaded an appropriate driver, and it's time to set out on a searchfor it. Type the full name of the device and the operating system name in the Google search line, and you will find the name of the required module or a message saying that the device is not yet supported by the kernel.

If the search for a driver came up dry, the only thing you can do is to wait for a new version of the kernel/OS and hope that it will support your equipment. Owners of unsupported network adapters can try their luck with the NDISWrapper framework, which is a Linux kernel module. It implements an interfacial layer for NDIS (Network Driver Interface Specification) drivers, designed for Windows.

Install the ndiswrapper package, copy the folder with the official driver for Windows from the disk, find the INF-file there and execute this command:

# ndiswrapper -i driver.inf

Check, whether the driver is loaded:

# ndiswrapper -l

Is everything alright? Load the module and configure the network:

# modprobe ndiswrapper

Patching the network

Problems concerning network connect is a scourge for newbies to Linux. Most user-friendly distribution packages find network interfaces by themselves and try to configure them by DHCP. This does not work all the time. First and foremost, execute the "dmesg | less" command in Linux or "less / var/run/dmesg.boot" in FreeBSD and find in its output the network adapter that you use to access the Internet or LAN. For example:

nfe0: <NVIDIA nForce2 MCP2 Networking Adapter> port …

The first word is the network interface name (in Linux it will be named eth0 or eth1). Run the ifconfig command and find this name in its output. If there is no such name, the interface is inactive; if there is not "inet" line in the output, it means it is not assigned an IP-address. You can activate the interface with the command:

# ifconfig interface inet IP-address netmask net mask up

Default gateway is not typically required for access to the local network, so the access to the local network must be opened after the execution of this command. In case if a default gateway is still used, execute the following command:

# route add default gw IP-gateway

Your ISP may use PPPoE or PPTP-server to provide access to the Internet. The setting up of such connections was described in detail in the article "Break through the PPP", published in May, 2008 issue of H. Now I say goodbye. Good luck!

INFO

Failing but still operable HDD is good enough to store temporary data. To do this, zap the partition table and create a new partition on the that disk area which escaped destruction.

To remove the boot screen and see the Linux initialization process in all its beauty, just delete "quiet" and "splash" options from the line which is accessible by pressing 'e' in the grub boot loader.

WWW

Salvation boot disk - www.sysresccd.org.

Partition table restorer gpart - www.brzitwa.de/mb/gpart.

Testdisk - An universal utility to recover anything and everything - www.cgsecurity.org/wiki/TestDisk.

Data Lifeguard dd_rescue - www.garloff.de/kurt/linux/ddrescue.

nvclock utility - www.linuxhardware.org/nvclock.

Stratum NDISWrapper for running Windows network drivers - sourceforge.net/projects/ndiswrapper.

If you need to resolve the OS loss, contact Data Recovery Pennsylvania, that offers one of the fastest possible solutions for OS recovery services.

Last modified on Monday, 18 May 2015 19:10

Data Recovery Expert

Viktor S., Ph.D. (Electrical/Computer Engineering), was hired by DataRecoup, the international data recovery corporation, in 2012. Promoted to Engineering Senior Manager in 2010 and then to his current position, as C.I.O. of DataRecoup, in 2014. Responsible for the management of critical, high-priority RAID data recovery cases and the application of his expert, comprehensive knowledge in database data retrieval. He is also responsible for planning and implementing SEO/SEM and other internet-based marketing strategies. Currently, Viktor S., Ph.D., is focusing on the further development and expansion of DataRecoup’s major internet marketing campaign for their already successful proprietary software application “Data Recovery for Windows” (an application which he developed).

Latest from Data Recovery Expert

More in this category: « NTFS Data Structure and Recovery Internals par2 utility: Adding recovery records to archives »

Make sure you enter the (*) required information where indicated. HTML code is not allowed.

Data Recoup 4.8 / 5 based on 208 user ratings

Sidebar

Main Menu