4

I have a problem similar to LIMITing a SQL JOIN, but with a slightly more complex requirement.

I want to search for Users and associated Transactions, which lie within a time range:

SELECT u.*, t.*
FROM User u
JOIN Transaction t ON t.user_id = u.id
WHERE t.timestamp >= ? and t.timestamp <= ?;

So far, so good. Now I want to repeat the query, but with a LIMIT on the number of users returned. There should be no limit on the number of transactions returned for a given user, though.

If I follow the approach suggested in the other question, this would translate into:

SELECT u.*, t.*
FROM (SELECT * FROM User LIMIT 10) u
JOIN Transaction t ON t.user_id = u.id
WHERE t.timestamp >= ? and t.timestamp <= ?;

This will not produce what I want: it will return the first 10 users, who might not have any transactions associated.

I want to return 10 users who have at least one associated transaction in the given time range.

How can I achieve this using MySQL?

Community
  • 1
  • 1
BenMorel
  • 30,280
  • 40
  • 163
  • 285

2 Answers2

1

You can use variables for this:

SELECT *
FROM (
  SELECT *,
         @rn := IF(@uid = user_id, @rn,
                   IF(@uid := user_id, @rn +1, @rn + 1)) AS rn 
  FROM (
     SELECT u.*, t.*
     FROM User u
     JOIN Transaction t ON t.user_id = u.id
     WHERE t.timestamp >= x and t.timestamp <= y) AS t
  CROSS JOIN (SELECT @rn := 0, @uid := 0) AS vars
  ORDER BY user_id) AS x
WHERE x.rn <= 10

Variable @rn is incremented by 1 every time a new user is returned by the query. So we can control the number of users returned using @rn <= 10.

Giorgos Betsos
  • 68,064
  • 7
  • 48
  • 83
  • Thank you, this works indeed, except a missing closing parenthese at the end of the second `IF()`. Is there a performance impact of the two subselects? – BenMorel May 31 '16 at 11:50
  • Also, thinking a bit more about it, it looks to me that MySQL will have to read the complete result (all rows) of the original JOIN (probably in a temporary table?), before returning only the first few ones while executing the outer SELECT. Can't this be a huge performance bottleneck when you have a lot of users, with only a few of them having matching transactions? – BenMorel May 31 '16 at 11:57
  • @Benjamin You can test all proposed queries with your actual data and tell us how they compare to each other. – Giorgos Betsos May 31 '16 at 12:20
1

You can do this without variables, but it requires repeating the join logic:

SELECT u.*, t.*
FROM (SELECT *
      FROM User
      WHERE EXISTS (SELECT 1
                    FROM Transaction t
                    WHERE t.user_id = u.id AND
                          t.timestamp >= ? and t.timestamp <= ?
                   )
      LIMIT 10
     ) u JOIN
     Transaction t
     ON t.user_id = u.id
WHERE t.timestamp >= ? and t.timestamp <= ?;

EDIT:

Probably the fastest answer is something like this:

select u.*, t.*
from (select user_id
      from (select user_id
            from transaction t
            where t.timestamp >= ? and t.timestamp <= ?
            limit 1000
           ) t
      limit 30
     ) tt join
     user u
     on tt.userid = u.id join
     transaction t
     on tt.userid = t.userid and t.timestamp >= ? and t.timestamp <= ?;

The first subquery chooses 1,000 matching records in the transaction table. My guess is that this is more than enough to get 30 users. This list is then joined to the user and transaction table to get the final results. By limiting the list without having to do a full table scan, the first query should be pretty fast . . . especially with an additional index on (timestamp, user).

Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
  • Thanks, I'm not fond of variable-based queries myself, but repeating the JOIN is even more cumbersome I would say, especially if the query gets more complex (and it will). Anyway, what's the performance impact of repeating the JOIN? Does MySQL have to do the job twice, or is there a performance optimization performed somewhere? – BenMorel May 31 '16 at 11:51