0

I'm building a dashboard that reports on changes in data over time. Rows are grouped by category, and each category has a change interval. For each category, I'd like to efficiently select the row whose time is closest to (most recent time for this category) - (interval for this category).

| category | value | time | interval |
|----------|-------|------|----------|
| 1        | 2     | 100  | 30       |
| 1        | 1     | 50   | 30       |
| 1        | 0     | 0    | 30       |    
| 2        | 2     | 120  | 90       |
| 2        | 1     | 70   | 90       |
| 2        | 0     | 5    | 90       |

In this example, the most recent category_1 time is 100. Since the category_1 interval is 30, I want to get the category_1 row closest to time (100 - 30) = 70, which in this case is the row with time 50.

On the other hand, the category_2 interval is 90, so we want the category_2 row closest to time 120 - 90 = 30, which is the row with time 5.

I'm working with Amazon RDS. I'd prefer to use widely-supported SQL features since the database may change.

Here are some options I'm aware of. First, I could use a correlated subquery to compute these values directly; however, this is not performant.

Second, I could create a custom aggregate function. However, results like Function to Calculate Median in SQL Server tend to use techniques that don't apply to my problem, like percentiles and cases.

For reference, here's how I'm getting the most recent row in each group:

SELECT s1.category, time
FROM dashboard s1 
JOIN (
    SELECT category, MAX(time) AS s2time 
    FROM dashboard
    GROUP BY category) AS s2
    ON s1.category = s2.category AND s1.time = s2time
ORDER BY category;

What's my best option here?

cuniculus
  • 636
  • 8
  • 15

2 Answers2

2

You can express this logic using a correlated subquery:

select t.*,
       (select t2.time
        from t t2
        where t2.category = t.category and
              t2.time <= t.time - t.interval
        order by t2.time desc
        limit 1
       ) as prev_matching_time
from t;
Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
  • Thank you! However, as I mentioned above, I'm concerned about the potential performance impact of correlated subqueries. Am I wrong to be concerned? – cuniculus Aug 23 '19 at 22:02
  • 1
    @cuniculus . . . Your question doesn't have a database tag, so it is hard to be more general. With an index on `(category, time)` -- in most databases -- the performance might be ok. – Gordon Linoff Aug 24 '19 at 12:30
1

Getting the most recent row in each group is the 1st step.
Then you have to calculate the difference between the time and the interval and join to the table with an ON clause containing the condition that the table's matching row will have the nearest time to that difference.
Here's is how I did it:

SELECT t.*
FROM dashboard t INNER JOIN (
  SELECT s1.category, s1.time - s1.interval time
  FROM dashboard s1 INNER JOIN (
    SELECT category, MAX(time) AS s2time 
    FROM dashboard
    GROUP BY category
  ) AS s2 ON s1.category = s2.category AND s1.time = s2time
) s 
ON s.category = t.category 
AND ABS(t.time - s.time) = (
  SELECT ABS(time - s.time) dif 
  FROM dashboard
  WHERE category = t.category
  ORDER BY dif
  LIMIT 1 
)

I can't tell about its efficiency, so you have to test it.
See the demo.
Results:

category | value | time | interval
-------: | ----: | ---: | -------:
       1 |     1 |   50 |       30
       2 |     0 |    5 |       90
forpas
  • 117,400
  • 9
  • 23
  • 54
  • 1
    Thank you for your help, this works perfectly. RE: efficiency - it's fine for my usage. I may need to optimize it eventually :). – cuniculus Aug 27 '19 at 14:57