Friday, December 19, 2008

DB Administration in MySql is a Cruel Joke

The implementation of ALTER TABLE in Mysql/Innodb is a cruel joke.  Except for RENAMing a table, it's implemented as (1) do a CREATE TABLE to make a new empty table, (2) INSERT all the rows from the existing table into it row-by-row, and then (3) drop the original.

So adding a new index to a table means rebuilding ALL the indexes on the table.  And it's done row-by-row.  That's a super-bogus sorting algorithm.

Want to add a COMMENT to a column?  Sorry have to rebuild the table and all its indexes.

What to add a new column?  Sorry have to rebuild the table and all its indexes.

Any way to parallelize all this rebuilding?  Absolutely not!!!!  No economy of scale whatsoever.  Everything's row-by-row.

There's a reason why folks fork out the big bucks for Oracle Enterprise.  No big bucks available?  Or want to invest them in your business rather than shipping them off to Oracle?  Have a stiff drink and try to relax.

"Scale Out" is not Scalability

MySql, and innodb in particular, don't scale.  Don't plan to run a boatload of concurrent transactions against your MySql db.  I'm hoping things are better in 5.1, but they are dismal in 4.1.

There's just way too much "semaphore" contention.  Too many global mutex locks blocking threads that really aren't stepping on each other.  Too many global mutexes held way too long. Disk i/o done by a thread holding a mutex!

MySql AB claims their product scales.  Their proscribed technique is "scale out".  But their "scale out" scheme is really a workaround for the fact their product doesn't scale.

There's a reason folks pay big bucks for Oracle Enterprise.  Meanwhile, us entrepreneurial types trying to compute on the cheap have to grin and bear it.

Brain Damaged Schema Locking

I have no clue what the motivation is behind mysql's schema level locking algorithm.  It seems to lock things when it doesn't need to but doesn't when it needs to.  These comments apply to versions 5.1 and earlier.

Case in point:  a table can be dropped while a transaction that updated it is still open.

This is a big problem for replication.  It stops replication in its tracks immediately.  Once the open transaction COMMITs, its SQL gets written to the binary log.  The problem is that the DROP TABLE gets written immediately to the binary log.  So the DROP TABLE is in the binary log before INSERT/UPDATE/etc. statements manipulating the table.  Needless to say, replication dies with a nonexistant table error when the latter statements are encountered by the slave.

Why doesn't the schema locking algorithm block the DROP TABLE until the open transaction terminates?  That's what would make sense to me.  (Psst, that's the way Oracle works.)

(This problem with DROP TABLE is a MySql bug reported in 2003!)

Tuesday, October 28, 2008

LVM vs. MD raid-0

Raid-0 is good when:  (1) you have a low budget and want maximal performance and capacity bang for the buck; (2) the most up-to-date version of your data is expendable, such as might be the case for a read-only MySql replication slave, where there's other up-to-date copies elsewhere (the master); and (3) you can survive a relatively lengthy outage if a disk fails (replacing a disk and restoring a backup is not fast).

I work for a start-up trying to squeeze blood out of every IT dollar, and Raid-0 has worked out well.

On Linux, there's two software Raid-0 solutions available:  LVM  and md.  I ran into a simple case where md performs significantly better than LVM.

I put eight 1TB SATA 7200 rpm 3.5" drives into a SAS enclosure directly attached to a SAS HBA card in the server.

This results in eight "/dev/sdXX" drives magically appearing in Linux.

First I did "cat /dev/sdXX >> /dev/null &" for each of the eight drives and used "iostat -x -k 3" to watch what was happening.  I saw about 67MB/s being read from each drive.

Then I striped the 8 drives together using md via:  mdadm --create /dev/md0 --level 0 --chunk 2048 --raid-devices 8 /dev/sdXX /dev/sdXY...

Then I did "cat /dev/md0 >> /dev/null" and used iostat again.  I saw about 66MB/s being read per drive.  Not much different than before.

I killed of the md raid-0, and then used lvm to create a raid-0:

vgcreate --physicalextentsize 1024M tbraid0 /dev/sdXX /dev/sdXY...
lvcreate -i 8 -I 2048 -l 7448 tbraid0 -n vol

I then did "cat /dev/mapper/tbraid0-vol >> /dev/null" and used "iostat".  I only saw about 41MB/s per drive being read.

The raw LVM raid-0 block device imposed a substantial overhead on what the hardware is capable of.

(P.S. I tried different chunk sizes for the raid-0 using lvm.  It did make a difference.  512K chunk size resulted in about 53MB/s per drive; 64K, 38MB/s.)

The version of RedHat I used:

# uname -a
Linux 2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 10:56:55 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

Monday, September 29, 2008

ERROR 1033: Incorrect Information in File

I ran into a problem today I hadn't seen before.

I had a db crash.  (Actually, I crashed it on purpose, but that's another story.)  When it came back up, everything seemed fine until I tried doing "show create table foobar" and got an error, "ERROR 1033: Incorrect information in file db/foobar".

Turned out the "foobar.frm" file was completely empty --- 0 bytes long.

So the front-end, which is non-transactional, got stopped after opening the file but before writing anything into it.  All our tables are stored in innodb, which is transational, but that doesn't save you from having the front-end leaving half-done junk around after a crash.

The solution?  I just deleted the bogus "foobar.frm" file.