Monday, April 24, 2006

Optimizing MySQL Applications Using the Pluggable Storage Engine Architecture (Arjen)

I am sitting in Arjen's session titled "Optimizing MySQL Applications Using the Pluggable Storage Engine Architecture." Arjen has been covering the basics of how databases work and touching up on issues related to different storage engines.

[Notes that I need to transfer (blogger wasn't working)]

Key Buffers: Are only for MyISAM tables and not for InnoDB tables. Look at the current values
Key reads / Key read requests ratio should be 0.03. If the ratio is higher than this (.10) then increase the buffer. If its better than this (0.1) then we are wasting memory. If at 0.1, decrease it a bit.

The default settings are quite low on Linux. We can have multiple MyISAM key caches.

We can load MySQL key buffers into memory.

Different values of myisam_recover which can either be QUICK or BACKUP (If some corrupted tables are found, they are first backed up).

Arjen recommends that not to use external locking and then repair the table (answering a question). He says to lock tables using LOCK command and then perform operations.

table_cache (defaults to 64). This deals with file handles which tells the maximum number of open table files. If joining three tables, three MYD files would be opened. Even if we join the same table, two files would be opened so a pointer can be obtained.

myisam_sort_buffer_size: increase this for index builds. A trick is to increase and decrease it again when not needed.

myisam_max_[extra]_sort_file_size: By increasing this we can speed things up.

MyISAM local variables: allocated for each individual connection to the server.

read_buffer_size: used when reading chunk of data. Get this to be as big as we can afford to have. Think about altering it before SELECTs and then decrease it to original as this will be local.

sort_buffer_size: For GROUP BY and ORDER BY operations. If MySQL needs more value than this then it will start using disk (bad). Increase sort buffer before query and then decrease it.

tmp_table_size: It's a limit in bytes. While processing queries, tmp tables are created which are of type HEAP. When we run out of the specified size set by this variable, the table type is changed from HEAP to MyISAM.

Make the local variables bigger than default but don't make them huge.

Never make your database server swap as it will cost you greatly.

How to optimize MyISAM tables?
  • Normalize
  • separate static and dynamic storage format tables. (don't worry about joins as databases are made for that)

If accessing variable and fixed length data equally then it makes no sense to different separate them.

Avoid NULLS and declare the columns to be NOT NULLs whenever possible.

Inside MyISAM structure we waste resources when using NULLs as many IF statements that are used to check whether the value is NULL can add up.

When performing batch deletes, updates, inserts, consider first disabling indexes and then enabling them.

Q & A: myisamchk updates the table stats when it runs.

Its not really a separate storage engine but instead a layer.

Why use MERGE tables?
Because of the filesystem limits. To store big tables, they must be splitted up to keep them manageable. That's when MERGE tables come in.

MERGE tables are very portable and their files can just be copied over. We can only perform SELECT, DELETE, UPDATE and INSERT statements when using MERGE tables.

We need more file descriptors which is a disadvantage. Also the index reads are slower and no global indexes are available.

MEMORY storage engine.

Only the structure of MEMORY tables (.FRM) is stored on disk. Data is volatile since it resides in RAM. MEMORY tables feature table locks. The size of MEMORY tables is specified using max_heap_table_size. Using a trigger, we can automatically change the table type from MEMORY to MyISAM using COUNT(). MEMORY tables are used primarily for caching, temporary tables and buffer tables. The primary options for indexing MEMORY tables are HASH (Hash Indexes) and BTREE (Red-black binary trees). Important to note that BTREE here isn't the same as BTREE used for MyISAM tables.

BTREE takes more space, supports range searches and is faster in cases where we have lots of duplicates.

HASH index is used by MEMORY tables by default.

InnoDB Storage Engine: It is transactional, ACID, supports foreign key constraints and performs data checksums to verify its integrity.

Has row level locking with versioning. Larger memory and on disk footprint. MySQL 5 has data compression. Likes more memory and if enough memory is given, it is pretty much equal in performance than MyISAM. In certain cases we can make InnoDB faster than MyISAM but that would require a lot of RAM.

It is good for apps with heavy updates. In case of server crash, the repairing is done automatically by InnoDB in the background. For primary keys, it offers faster access by primary keys and is better in memory performance. ALso offers row-level locking (SELECTs don't LOCK).

For InnoDB , always have PRIMARY key as short as possible (don't use VARCHAR). Even CHAR(30) is really really bad.

Transactions are run in memory. InnoDB knows the difference between committed and uncommitted page.

innodb_buffer_pool_size: set to as large as possible (think 60-80% of unused memory).

In write intensive system, have a large buffer pool.

innodb_log_files_group: (used to be 3)

innodb_log_file_size specifies the size of each log file. Setting it too high will increase the time that goes in crash recovery. Recommended to set it to 50% of innodb_buffer_pool_size.

innodb_log_buffer_size: Recommended to set between 1M and 16M (default: 8MB)

innodb_flush_method: by default it uses fsync()

innodb _thread_concurrency: number of threads running in the background.

InnoDB Performance:
If possible, insert in the order of primary keys.
Sort the data externally before loading/inserting it in InnoDB
Can optionally set FOREIGN_KEY_CHECKS=0 (risky)
Use Prefix keys i.e. INDEX (lastname(3)) to optimize (relatively new in InnoDB). This feature works with MyISAM too.

NDB Cluster Storage Engine: (Not a general purpose storage engine, redundant and very fast for some purposes)
Designed for high availability and scalability (if used with load balancing). NDB Cluster Storage Engine also features synchronous replication and two-phase commit (can we commit?). Only commits when everything is ready. Performs internal replication and distributes table to several nodes. Offers multiple ways to be fast and redundant.

FEDERATED Storage Engine:
Offers the ability to access remote MySQL servers as if they were local. Very useful when using with triggers. MySQL is looking at offering hetrogeneous connectivity in future using JDBC for Federated tables.

ARCHIVE Storage Engine: If data is more than 1.5 GB then the table scans are much faster than MyISAM. ARCHIVE Storage engine is recommended for logging, auditing and backups. All data stored is compressed using zlib and offers sequential storage. We can only insert and select (performed using full table scans).

Any engine can be replicated.

CSV Storage Engine is not a general purpose storage engine. Ideal for storing CSV files.

Overall a great session. I will be adding more notes and clarifying the above points as I get time ;)

No comments: