0

Given the following two tables, I'd like to know how to calculate the MEDIAN rating for reviews on a weekly basis.

reviews

(id, user_id, completed_at)

reviews.completed_at -- lets us know the user submitted the review, it's not a draft.

reviews_areas

reviews_areas (created_at, review_id, rating)

reviews_areas.rating = INT between 0…10)

Example Data:

reviews:

+----+---------+---------------------+
| id | user_id |    completed_at     |
+----+---------+---------------------+
|  1 |     100 | 2019-07-20 11:34:40 |
|  2 |     100 | 2019-07-22 11:34:40 |
|  3 |     500 | 2019-07-30 16:34:40 |
+----+---------+---------------------+

reviews_areas:

+------------+-----------+--------+
| created_at | review_id | rating |
+------------+-----------+--------+
| 1:34:40    |         1 |      0 |
| 12:34:40   |         1 |      5 |
| 11:34:40   |         1 |     10 |
| 5:34:40    |         1 |      9 |
| 6:34:40    |         2 |      1 |
| 1:34:40    |         2 |      5 |
| 2:32:40    |         3 |      5 |
+------------+-----------+--------+
halfer
  • 18,701
  • 13
  • 79
  • 158
AnApprentice
  • 97,752
  • 174
  • 583
  • 971
  • Median of what? – Gordon Linoff Jul 31 '19 at 00:23
  • I think rating) – Y.K. Jul 31 '19 at 00:23
  • My apologies, the median of the ratings – AnApprentice Jul 31 '19 at 00:24
  • 1
    what does the `reviews_areas` table mean? `created_at` stores without date? We need to extract week from `completed_at` column in `reviews`? – Y.K. Jul 31 '19 at 00:33
  • reviews.completed_at -- lets us know the user submitted the review, it's not a draft. – AnApprentice Jul 31 '19 at 00:36
  • answer other questions please so that I can help you – Y.K. Jul 31 '19 at 00:38
  • `reviews_areas` is areas in the reviews... Think of Yelp, you rate the restaurant and then you can optionally rate multiple aspects like price, food quality, friendlieness etc... You have to rate at least one reviews_areas but you are not limitied to just 1. does that help? – AnApprentice Jul 31 '19 at 01:06
  • Is the week from `reviews.completed_at` or `reviews_areas.created_at`? If it's `reviews_areas.created_at`, how do we get the week from a time? – Schwern Jul 31 '19 at 01:09
  • We could do something like `floor(datediff(reviews.created_at, curdate()) / 7) AS weeks_ago and then `GROUP BY weeks_ago ORDER BY weeks_ago DESC` - what do you think of that? – AnApprentice Jul 31 '19 at 01:10

1 Answers1

1

The problem is not clear. I'll use reviews.completed_at for the date because reviews_areas.created_at contains just a time.

We need to join reviews for the date with reviews_areas for the rating.

To avoid the same week on different years from overlapping, we use yearweek to turn dates into the year + week.

To get the median we need to find the middle row (or rows if there's an even number) for each week. There's many ways to do this. I'll crib from Justin Grant's clever answer. We count the row_number() ordered ascending and descending. Where they overlap +/- 1 are the median rows. Then we average them.

1 2 3 4 5 6
6 5 4 3 2 1
    ^^^
    median rows

First, we get the row numbers by week.

select
    yearweek(completed_at) as week,
    rating,
    row_number() over(
        partition by yearweek(completed_at)
        order by rating asc, id asc
    ) as row_asc,
    row_number() over(
        partition by yearweek(completed_at)
        order by rating desc, id desc
    ) as row_desc
from reviews_areas ra
join reviews r on r.id = ra.review_id

The row numbers are ordered by order by rating asc, id asc. The id is a secondary sort to disambiguate rows with the same rating.

+--------+--------+---------+----------+
| week   | rating | row_asc | row_desc |
+--------+--------+---------+----------+
| 201928 |     10 |       4 |        1 |
| 201928 |      9 |       3 |        2 |
| 201928 |      5 |       2 |        3 |
| 201928 |      0 |       1 |        4 |
| 201929 |      5 |       2 |        1 |
| 201929 |      1 |       1 |        2 |
| 201930 |      5 |       1 |        1 |
+--------+--------+---------+----------+

Then we use that as a common table expression to average the middle rows of each week. A subquery works just as well.

with rating_weeks as (
    select
        yearweek(completed_at) as week,
        rating,
        row_number() over(
            partition by yearweek(completed_at)
            order by rating asc, id asc
        ) as row_asc,
        row_number() over(
            partition by yearweek(completed_at)
            order by rating desc, id desc
        ) as row_desc
    from reviews_areas ra
    join reviews r on r.id = ra.review_id
)
select
    week,
    -- Take the average of the possibly 2 median rows
    avg(rating)
from rating_weeks
where
    -- Find the rows which overlap +/- 1. These are the median rows.
    row_asc in (row_desc, row_desc - 1, row_desc + 1)
group by week
order by week
+--------+-------------+
| week   | avg(rating) |
+--------+-------------+
| 201928 |      7.0000 |
| 201929 |      3.0000 |
| 201930 |      5.0000 |
+--------+-------------+
Schwern
  • 127,817
  • 21
  • 150
  • 290