Do not Panic! Remote Server (Hetzner) not rebooting any more – A Solution

Note: This post is 9 years old. Some information may no longer be correct or even relevant. Please, keep this in mind while reading.

I went through this experience recently. First of all, don’t panic! I panicked, and because of this, I made a mistake: I didn’t wait long enough for it to come online. Had I waited up to 60 minutes, it would probably have come online (see reason below). The story:

I had broken packages on my Ubuntu 10.04 server and decided to fix them by

apt-get update
apt-get upgrade

While updating, I noticed that the package grub-pc  also was upgraded, apparently a new bootloader (or bootloader configuration) was installed. This made me feel uncomfortable, since I didn’t know if the server would reboot after this upgrade. So, because of the saying “The devil you know is better than the devil you don’t know” (and a desire to sleep peacefully at night!) I decided to reboot the server and see what would happen. To my big dismay, it did not come online. SSH connections failed with “Port 22: Connection refused” .

I panicked and asked the Hetzner Support (which is very responsive and supportive btw!) to install a LARA Remote Console so that I could see the text output of the booting screen. After some regular startup text, the screen became blank. I panicked even more and took immediate steps to move all data to a new server. It took me 6 hours to complete the most important parts, and another full week of restless work to finish it. We will see that it was not necessary.

Most regular servers at the Hetzner data center are running Software RAID. It seems that after a reboot (especially if you send a Hardware Reset) the OS needs some time to re-sync or check the file system. I am not sure what caused the delay in my case. Re-syncing the entire RAID array can take up to 1-2 hours, depending on your hardware and disc space.

So, wait at least 1 hour for it to come online, especially after a hardware reset! When you activate Hetzner’s Rescue system (which is very good btw!) it will stay active for a minimum of 1 hour, so your server will be down for 1 hour at least, in any event. So you are not losing much by waiting a bit longer.

Now, in my case, I assumed that Grub2 was broken. So I activated the Hetzner Rescue System, booted into it, and reinstalled Grub2. I have found the following method here and it worked for me. First you have to mount the regular RAID filesystem under /mnt :

mount /dev/md2 /mnt
mount /dev/md1 /mnt/boot
mount -t dev -o bind /dev /mnt/dev
mount -t proc -o bind /proc /mnt/proc
mount -t sys -o bind /sys /mnt/sys
chroot /mnt

At this point, you are in your regular root directory. To reinstall Grub2 the Debian Way, I did:

apt-get install --reinstall grub-pc

To make really sure, I reconfigured the package:

dpkg-reconfigure grub-pc

It will ask you where to install the bootloader. I selected:

[*] /dev/sda
[*] /dev/sdb

No errors were reported. I rebooted again and it did not come online immediately, for the reasons previously mentioned. I waited long enough (in my case, 15 minutes) and it did come online. So, rule number one is: Don’t panic!

A solution for MySQL Assertion failure FIL_NULL

Note: This post is 9 years old. Some information may no longer be correct or even relevant. Please, keep this in mind while reading.

A defective RAM module recently caused data corruption in MySQL tables. MySQL would log the following to /var/log/syslog  in regular intervals, about every few minutes:

140125 5:04:41 InnoDB: Assertion failure in thread 140046518261504 in file fut0lst.ic line 83
InnoDB: Failing assertion: addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA
InnoDB: We intentionally generate a memory trap.

Reading MySQL documentation and various blogs didn’t help much. I ran CHECK TABLES  on all the tables and they all reported OK. Then I ran

mysqlcheck --all-databases

and still all tables reported OK. Nevertheless the Assertion Failures continued. Then I stumbled across this excellent blog post, which suggested to dump evertying into a .sql file, wipe /var/lib/mysql (not without making a backup, mind you!), and re-import everything from scratch. This is what I did and it worked.

I recorded the passwords for each database and collected it into a SQL script like this:

GRANT ALL PRIVILEGES ON database_name1 TO 'username1'@'localhost' IDENTIFIED BY 'password1';
GRANT ALL PRIVILEGES ON database_name2 TO 'username2'@'localhost' IDENTIFIED BY 'password2';
etc.

When you dump all databases into a .sql file, it will not dump the permissions, so you will need to restore them later with this script. Next, the dumping part, then removal and reinstallation of mysql (Danger here: When you remove mysql-server, all packges which depend on it also will be removed!):

cd ~
mysqldump -u root -ppass --all-databases > alldbs.sql
cd /var/lib
cp -vpr mysql mysql-backup
apt-get remove --purge mysql-server-5.5
apt-get install mysql-server

Here, I had to reset the MySQL admin password because it didn’t work any more, so I ran:

dpkg-reconfigure mysql-server-5.5

Then, I re-imported all databases from the dump file:

cd ~
mysql -u root -ppass < alldbs.sql

Then I run the SQL permission script that I mention above.

For me, this resulted in no more Assertion Failures. Yay!

Adventures with various Segfaults due to defective RAM

Note: This post is 9 years old. Some information may no longer be correct or even relevant. Please, keep this in mind while reading.

If you get various segfaults on your Linux server, like these:

spamd child[2656]: segfault at 200251c208 ip 00007fa039223684 sp 00007fff77953680 error 4 in libperl.so.5.14.2[7fa03916a000+177000]

or:

clamd[3311]: segfault at 1000000008 ip 00007f00200b3751 sp 00007fff3e2cef60 error 4 in libclamav.so.6.1.17[7f001fff1000+988000]

or

php5[14914]: segfault at 7fff7d2939c8 ip 00000000006bf04d sp 00007fff6d293860 error 6 in php5[400000+6f3000]

or

PassengerHelper[11644]: segfault at ffffffffca4ef420 ip 0000000000492fea sp 00007f5b81e991d0 error 7 in PassengerHelperAgent[400000+203000]

etc. etc., then, no, your system is not suddenly crazy. Nor are you. It is highly likely that you RAM is defective. You should reboot your server and run the  RAM test from your boot manager (Grub always has such a test) to see if it can detect faulty RAM.

If you are operating a server that you can’t reboot because you can’t tolerate downtime, there is an excellent tool calledmemtester , which is a memory test for a running system. It is part of the Debian distribution, installit with apt-get install memtester  Check top to see how much free RAM there is available. Say you have 10GB RAM free, then ask memterst to test 8GB of it (so that 2GB are remaining free for the running system to operate). In my case, memtester indeed detected faults.

I ran

memtester 8000 3

It outputted stuff like this:

Loop 1/3:
Stuck Address : ok 
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : testing  30FAILURE: 0xffffffffffffffff != 0xfffffffbffffffff at offset
0x36e77910.
Block Sequential : ok 
Checkerboard : ok 
Bit Spread : ok 
Bit Flip : ok 
Walking Ones : ok 
Walking Zeroes : ok 
8-bit Writes : ok
16-bit Writes : ok

So, when I replaced the RAM, the Segfaults stopped. You can runmemtester  regularly to make sure the RAM is okay. Healty RAM is a very crucial part of your successful hosting operation!

In my case however, the segfaults corrupted MySQL tables, which I had to clean up. All’s well that ends well!