GROUP BY without the GROUP BY in MySQL


Ever have a need to perform a GROUP BY command in MySQL against a really large table?  If you’ve done this type of operation in MySQL on a very large table, say more than a few million rows, then you probably know the workload this entails, and spent time scratching your head while staring at the screen waiting for a result to come back.  Consider this example:

CREATE TABLE product_sale_history (
 transaction_id BIGINT(20) NOT NULL,
 product_id     BIGINT(20) NOT NULL,
 sale_timestamp DATETIME NOT NULL,
 qty            INT(11) NOT NULL,
 PRIMARY KEY (transaction_id),
 KEY product_id (product_id)
 ) ENGINE=InnoDB;

Let’s say that in this table, you have 20 million rows, and you want to sum up the total sales by product id.  Logically, this is done with a simple GROUP BY clause, like this:

SELECT product_id, SUM(qty) as total_qty
 FROM product_sale_history
 GROUP BY product_id;

Now while this makes perfect sense, there is a problem. In order to resolve the query,  MySQL will try to pull the entire table into memory, sort the data by product id, then sum up the qty column. Based on your memory configuration, unless you have tons of memory, this will probably end up writing a bunch to temp space, etc, and run about as slow as an alcoholic snail after a weekend bender.

So how do you make this thing faster? Well the trick is to avoid all that sorting and I/O caused by writing to temp space. MySQL (and MariaDB, etc) provides the ON DUPLICATE KEY clause. What this guy does is allow you to stream the data from your base table directly into another table, and sum up your data in the process. No sorting involved. It just reads from one table and writes to the other. It depends directly on having a primary key column in your summary table, and you will intentionally cause duplicate key errors, but instruct MySQL to do something about it, instead of just dumping with an error.

Confused yet? Don’t worry… It’s simpler than it sounds!

For our example above, instead of doing the GROUP BY clause, we would create a summary table to hold our results:

CREATE TABLE product_id_sale_counts (
 product_id BIGINT NOT NULL PRIMARY KEY,
 total_qty  INT    NOT NULL
 );

Now that we have a table to hold the results, we can write a simple INSERT statement that reads from the product_sale_history table and writes to the product_id_sale_counts table:

INSERT INTO product_id_sale_counts
 SELECT product_id, qty
 FROM product_sale_history;

Ok, so this will grab the entire sale history from the product_sale_history. But something should jump out at you. In the product_sale_history table, there will be multiple entries for each product_id (each entry with a different value for qty). As the data loads into the table, the first insert of a product id will be successful, but the next time that same product id is encountered, its going to bounce back with a duplicate key error (on product_id). This is where the magic happens…

Instead of letting it error, we will tell MySQL to do something intelligent about it, such as incrementing total_qty by adding qty to it! Here’s the command:

INSERT INTO product_id_sale_counts
 SELECT product_id, qty
 FROM product_sale_history
 ON DUPLICATE KEY UPDATE total_qty = total_qty + qty;

Awesome, ain’t it?? So now, for every product id that’s found, it will either insert a new row for the first time its encountered, or it will add qty to total_qty for subsequent rows for that product id. The benefit is that this can be done without the additional overhead of sorting all the data in memory.

This can be expanded if you want to summarize by hour, by day, by month, etc on sale_timestamp if you like. Experiment with different variations… do some benchmarks… I guarantee that you will end up using this quite a bit more often than GROUP BY!

Leave a Reply