1

I wrote a statistics/tracking platform to monitor my websites activity. For each user there is an entry created in a table with some relevant info: IP, Ref, GeoLocate countrycode etc.

The database contains well over 7MM results and has a size of about 4GB.

Now, I have a script which has to do SUM queries to get the numbers of impressions, clicks etc. for a defined time interval ($from_date & $to_date).

$query = "SELECT COUNT(ip) as tot, sum(clicked=0) as a0, sum(clicked=1) as a1, sum(converted=0) as a2, sum(converted=1) as a3 FROM tf_data WHERE cid='".$cid."' and stamp between ".$from_date." and ".$to_date." "; 

Alone this query takes FOREVER to load, even tho I'm on a decent VPS Machine. I have some other queries like this in my summary page, which causes the script to timeout frequently (REQUEST TIMEOUT: This request takes too long to process, it is timed out by the server. If it should not be timed out, please contact administrator of this web site to increase 'Connection Timeout'.)

What should I do? How can I handle that huge amout of data? Should I create a separate table and run a cron job to insert/update the impressions number, clicks number...?

How is such a thing normally done?

Thanks as always!

EDIT: TABLE STRUCTURE:

CREATE TABLE `tf_data` (
`click_id` int(11) NOT NULL AUTO_INCREMENT,
`ip` varchar(225) NOT NULL,
`agent` text NOT NULL,
`referer` text NOT NULL,
`stamp` text NOT NULL,
`country` varchar(30) NOT NULL,
`src` text NOT NULL,
`adspot` varchar(250) NOT NULL,
`cid` text NOT NULL,
`adid` text NOT NULL,
`lp` varchar(250) NOT NULL,
`offer` int(11) NOT NULL,
`clicked` int(11) NOT NULL,
`converted` int(11) NOT NULL,
`converted2` int(11) NOT NULL,
`price` varchar(255) NOT NULL,
PRIMARY KEY (`click_id`),
UNIQUE KEY `ip` (`ip`)
) ENGINE=MyISAM AUTO_INCREMENT=9599999 DEFAULT CHARSET=latin1
DonCroce
  • 465
  • 3
  • 8
  • 17
  • Do you have an index on `stamp`? – nico Nov 06 '11 at 10:06
  • Can you post your table structure `SHOW CREATE TABLE tf_data`? – Konerak Nov 06 '11 at 10:08
  • Perhaps you should move to an OLAP database in order to summarize data. Something like palo http://en.wikipedia.org/wiki/Palo_(OLAP_database), Microsoft analysis services, tableau software, .... – dani herrera Nov 06 '11 at 10:19
  • I have added the table structure – DonCroce Nov 06 '11 at 11:07
  • take a look at the innodb engine - pay particular attention to clustered primary key indexes - the following answer has the required reading http://stackoverflow.com/questions/4419499/mysql-nosql-help-me-to-choose-the-right-one-on-a/4421601#4421601 – Jon Black Nov 06 '11 at 16:20

3 Answers3

4

Firstly, if stamp is your timestamp ( I assume it is ), convert it to such and put a index on it, this will allow you to more easily access small subsets of the data ( by date ).

Then create a summary table and do this calculation per day for all the existing data, then have a daily job to keep this up to date.

This way, for historical usage you only need to look at the ( much smaller ) summary table, and possibly at the new rows since the last summarisation.

If you don't always need to look at all the data, you could also consider looking at table partitioning ( be it built in or poor mans ) , this way you can still reference all the data as a whole when you need to, but only access a subset of it when that is required for speed.

Matthew Watson
  • 13,768
  • 8
  • 59
  • 82
2

7 million rows isn't all that much. How many rows does a normal WHERE clause return?

Whenever you have a query performance problem, look at the query execution plan.

Ask yourself, "Do I have the right data types?" As Matthew Watson pointed out, you don't have a timestamp (DATETIME) data type in that table. In this case, using DATETIME instead of TEXT will probably reduce the size of your database by about 10 bytes per row. (I guess "stamp" could be a DATE instead of DATETIME. Less savings, but still savings, and no conversions for date comparisons.)

Ask yourself, "Can I improve the indexing?" You need indexes on "cid" and "stamp" at the very least.

Ask yourself, "Can I reduce the number of columns in either the base table or in the query?" (The combination of "converted" and "converted2" is suspicious.)

Ask yourself, "Can I reduce the number of rows?"

After you've done all that, consider partitioning.

After partitioning (or maybe before partitioning) consider OLAP tables for summarizing.

Mike Sherrill 'Cat Recall'
  • 82,047
  • 16
  • 110
  • 161
1

I would suggest offline computation + table partitioning + HASH index on cid and BTree index on stamp as a quick solution. Long term solution would be using a NoSQL datastore solution such as Cassandra and maybe Hadoop + Pig to calculate and push the data to Cassandra.

dmn
  • 96
  • 7