3

I have a mysql table with the potential for millions of rows of data - in some extreme cases up to 100mil. There an application I've developed that often queries this data and I've done what I can to optimize it - for the most part it works very quickly because we're only searching a very small subset of the data (tied to locations).

Table structure:

CREATE TABLE `prism_actions` (
  `id` int(11) unsigned NOT NULL auto_increment,
  `action_time` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `action_type` varchar(25) NOT NULL,
  `player` varchar(16) NOT NULL,
  `world` varchar(255) NOT NULL,
  `x` int(11) NOT NULL,
  `y` int(11) NOT NULL,
  `z` int(11) NOT NULL,
  `block_id` mediumint(5) unsigned NOT NULL,
  `block_subid` mediumint(5) unsigned NOT NULL,
  `data` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `x` (`x`),
  KEY `action_type` (`action_type`),
  KEY `player` (`player`),
  KEY `block_id` (`block_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

I have several basic indexes for the fields we're using in our WHERE statements most, and when used in queries with only a single condition - it's quite fast.

The example table I'm running these tests on has 22 million records.

Examples:

SELECT prism_actions.id FROM prism_actions WHERE prism_actions.action_type = 'block-break' LIMIT 1000;
1000 rows in set (0.00 sec)

SELECT prism_actions.id FROM prism_actions WHERE prism_actions.block_id = 2 LIMIT 1000;
1000 rows in set (0.01 sec)

My problem is that for each condition we use in a query (most queries usually have several conditions), the query takes much longer.

SELECT prism_actions.id FROM prism_actions WHERE prism_actions.action_type = 'block-break' AND prism_actions.block_id = 2 LIMIT 1000;
1000 rows in set (0.79 sec)

.79 seconds would be acceptable for the full query but that's just using part of the conditions.

A real query is more like:

SELECT prism_actions.id FROM prism_actions WHERE prism_actions.action_type = 'block-break' AND prism_actions.player = 'viveleroi' AND prism_actions.block_id = 2 LIMIT 1000;
1000 rows in set (2.22 sec)

With a single condition we ran with 0.01, with two we ran with 0.79 and with three we ran at 2.2 seconds which is too long.

I'm going to research how I can better design my indexes but I'm mostly comfortable with the current db schema and indexes.

However, what can I try to make the conditions faster when used together like that?

Update

I spent time converting the tables into a foreign-key format. The player, action_type, and world column data were moved to separate tables, and the IDs for those were stored in the original tables. Took several hours to migrate the data.

However, I'm re-running the same queries I had used earlier and while I see a speed increase in some, I see little change in others.

The converted version of the .79 second query above runs about the same speed:

SELECT prism_actions.id FROM prism_actions WHERE prism_actions.actiontype_id = 1 AND prism_actions.block_id = 2 LIMIT 1000;
1000 rows in set (0.73 sec)

The block_id col still has an index from the original table schema.

Queries with the player_id as a condition were running very slowly so I added an index to the column and now the queries are blazing fast.

However, after taking examples of several queries from a real user, and updating them for this table structure, I see no change in speeds.

SELECT prism_actions.id FROM prism_actions WHERE (prism_actions.actiontype_id = 2 OR prism_actions.actiontype_id = 1) AND (prism_actions.player_id = 1127) AND prism_actions.action_time >= '2013-02-22 07:47:54' LIMIT 1000;

Previously took 5.83 sec, currently takes 5.29 sec

Edit - seems to be the timestamp. Taking the timestamp condition out of the above query returns the results in 0.01 seconds. Adding an index for the timestamp does nothing - ideas?

So far, all I'm really seeing as is a slight speed increase in certain areas, a small file space savings since we store duplicate strings - but nothing so far that would warrant asking hundreds of users with databases this large to spend a day's worth of time converting data.

Any suggestions for some other way I might index stuff, etc?

helion3
  • 27,515
  • 13
  • 48
  • 91
  • what is the case if the engine is innodb? are they slower? – Pradyut Bhattacharya Mar 01 '13 at 17:12
  • Searching on the text columns will inevitably be much slower than searching on the int column, that's why the time goes up massively when you include these in the where clause. – cusimar9 Mar 01 '13 at 17:15
  • @PradyutBhattacharya MyISAM is advised for read only tables. InnoDB will be slower. – SparKot Mar 01 '13 at 17:16
  • can you make use of partitions? – SparKot Mar 01 '13 at 17:24
  • What's your MySQL version? – SparKot Mar 01 '13 at 17:32
  • This is for a distributed application so I'm limited to things that can be done on user's mysql servers, which are a little unpredictable. I think that trying to store text for player/worlds etc is just a bad idea entirely and that I really need to convert them to foreign keys. I'm doing that now and will see how much of a difference that makes. – helion3 Mar 01 '13 at 17:41
  • The following answer may help you optimise although you'll need to take the plunge and convert to the more performant innodb engine. Instead of using alter table to convert I would export the data (csv) ordering the output in your chosen clustered primary key order then re-import into the newly created innodb tables using one or more transactions to speed up load times. See http://stackoverflow.com/questions/4419499/mysql-and-nosql-help-me-to-choose-the-right-one/4421601#4421601 and http://stackoverflow.com/questions/2463602/mysql-load-data-infile-acceleration/2504211#2504211 – Jon Black Mar 04 '13 at 19:42

3 Answers3

0

move all the text columns (action type, player, world ) all these are text columns to a new table.

this will reduce the db size and keep the refernce numbers in this table.

This will dramatically improve performance.

Pradyut Bhattacharya
  • 4,881
  • 13
  • 46
  • 78
  • I kinda figured. I'll see what I can do to transfer the data easily. I've considered making this adjustment already for a few other factors. I used to have a snippet of code pre-written for converting tables to foreign key... need to go find that. I'll see what impact that has and report back – helion3 Mar 01 '13 at 17:23
  • What about the timestamp field? Most of the time we're limiting the queries to a after or before a certain date - would it be more efficient to somehow store that in a different format (like a unix epoch timestamp)? I have it set as a timestamp on update so that we can offload the timestamp generation to mysql, which is a huge performance gain for our app. – helion3 Mar 01 '13 at 17:56
  • timestamp will not have any impact. You can use a integer field instead of timestamp but you will need to convert them to and back to timstamp using mysql's UNIX_TIMESTAMP() function. But i think converting from int to timestamp would also take time while you are retrieving. – Pradyut Bhattacharya Mar 01 '13 at 20:23
  • Updated the post with some initial findings. – helion3 Mar 02 '13 at 01:08
0

MySQL v5.5: you can create PARTITION BY RANGE COLUMNS as:

CREATE TABLE `prism_actions` (
  `id` int(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `action_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `action_type` varchar(25) NOT NULL,
  `player` varchar(16) NOT NULL,
  `world` varchar(255) NOT NULL,
  `x` int(11) NOT NULL,
  `y` int(11) NOT NULL,
  `z` int(11) NOT NULL,
  `block_id` mediumint(5) UNSIGNED NOT NULL,
  `block_subid` mediumint(5) UNSIGNED NOT NULL,
  `data` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `x` (`x`),
  KEY `action_type` (`action_type`),
  KEY `player` (`player`),
  KEY `block_id` (`block_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1
PARTITION BY RANGE COLUMNS(action_type, player, block_id)(
PARTITION p0 VALUES LESS THAN ('dddddddd','dddddddd',1000000),
PARTITION p1 VALUES LESS THAN ('gggggggg','gggggggg',2000000),
PARTITION p2 VALUES LESS THAN ('jjjjjjjj','jjjjjjjj',3000000),
PARTITION p3 VALUES LESS THAN ('mmmmmmmm','mmmmmmmm',4000000),
PARTITION p4 VALUES LESS THAN ('pppppppp','pppppppp',5000000),
PARTITION p5 VALUES LESS THAN ('ssssssss','ssssssss',6000000),
PARTITION p6 VALUES LESS THAN ('uuuuuuuu','uuuuuuuu',7000000),
PARTITION p7 VALUES LESS THAN (MAXVALUE,MAXVALUE,MAXVALUE)
);

Worst case is for any given set of (action_type, player, block_id) is, they can belong to one partition only. Hence, it's better in comparison to original query.

Bonus speed, if you can analyse frequency distribution of column values and make partitions accordingly. Above partitions are rough intervals.

SparKot
  • 1,906
  • 1
  • 13
  • 24
  • That's an interesting solution. I think the foreign key conversion will be more beneficial, but in addition, this is a distributed app and I don't want to exclude people using slightly older versions of mysql. – helion3 Mar 01 '13 at 17:58
  • if those text fields are static(predefined set), yes normalising(moving to another table) will definitely increase performance on single table. You'll be joining for combined result. Hence select then join; not select after join. – SparKot Mar 01 '13 at 18:12
0

I would leave your table alone to prevent needing to join again after your result set. You just need a single index with all the key columns of the where, not each one individually. I would try to optimize this based on the smallest result set you would hit first, such as 22 million records, I bet quite a few based on Block_ID = 2 and much less based on player.

So, I would have an index on

create index multipart on prism_actions ( Player, Block_ID, Action_Type );

AS A SINGLE INDEX, not individual fields as you currently have. This allows the engine to jump directly to a given player, now from 22 million, down to say 2000 entries, to block ID = 2 now down to 200, down to action_type = block break.... 20 records ... obviously just arbitrary samples of record counts, but the compound index should be all you need.

DRapp
  • 43,407
  • 11
  • 68
  • 131
  • My primary concern with a combined index is that the queries are relatively unpredictable. The app that sits on top of this database allows people to search their dataset with a list of parameters for matching nearly any of the fields. The query could be looking for players, with action_types, at coordinates, for block_ids, OR it could be looking for all records for players, OR all actions involving a block. How does that impact your recommendation? – helion3 Mar 01 '13 at 20:48
  • A multi-part index can be utilized by the engine. You might opt to build a couple of indexes based on what MIGHT be more common searches. The engine will use what is best per query. – DRapp Mar 01 '13 at 21:14
  • If I were to build a combined index like this in addition to what we have already, how would that impact write speeds? Write speed is of much less priority for me, because the writes happen from a queue when no one is really waiting for them - but slow reads are noticed by users when they take too long. Just trying to weigh the benefits. – helion3 Mar 01 '13 at 21:18
  • @BotskoNet, I don't know exactly how the engine fully utilizes compound indexes when applying to queries that only use a second or third field of such compound index... However, if you have a couple based on different "probable" or "more common" criteria you should benefit... You could also have different indexes have a different field in the first position so it might hit better for query match or just because it's first field is part of a query... I would try the multiple index and see what results you DO get. – DRapp Mar 01 '13 at 21:44