6

I am trying to identifying the trending tags (based on maximum hits) on time series using mysql json feature. Below is my table

CREATE TABLE TAG_COUNTER (
    account       varchar(36) NOT NULL,
    time_id       INT NOT NULL,
    counters      JSON,
    PRIMARY KEY   (account, time_id)
)

In every web api request, i will be getting the multiple different tags per account, and based on number of tags, i will prepare the INSERT ON DUPLICATE KEY UPDATE query. Below example is showing insertion with two tags.

INSERT INTO `TAG_COUNTER`
  (`account`, `time_id`, `counters`)
VALUES
  ('google', '2018061023', '{"tag1": 1, "tag2": 1}')
ON DUPLICATE KEY UPDATE `counters` =
  JSON_SET(`counters`,
           '$."tag1"',
           IFNULL(JSON_EXTRACT(`counters`,
                        '$."tag1"'), 0) + 1,
           '$."tag2"',
           IFNULL(JSON_EXTRACT(`counters`,
                        '$."tag2"'), 0) + 1
  );

time_id is yyyyMMddhh, and it is hourly aggregation on each row.

Now my problem is retrival of treding tags. Below query will give me aggregation for tag1, but we will not be knowing the tags before making this query.

SELECT
SUBSTRING(time_id, 1, 6) AS month,
SUM(counters->>'$.tag1')
FROM TAG_COUNTER
WHERE counters->>'$.tag1' > 0
GROUP BY month;

So i need generic group by query along with order by to get the trending tags for the time hourly/daily/monthly.

The sample of output expected is

Time(hour/day/month)  Tag_name  Tag_count_value(total hits)

When i have searched the web, every where it is mentioned like below {"tag_name": "tag1", "tag_count": 1} instead of direct {"tag1" : 1} and they were using tag_name in the group by.

Q1) So is it always mandatory to have common known json key to perform group by ..?

Q2) If i have to go with this way, then what is the change in my INSERT ON DUPLICATE KEY UPDATE query for this new json label/value struture? Since the counter has to be created when it is not existing and should increment by one when it is existing.

Q3) do i have to maintain array of objects

[
 {"tag_name": "tag1", "tag_count": 2},
 {"tag_name": "tag2", "tag_count": 3}
]

OR object of objects like below?

{
 {"tag_name": "tag1", "tag_count": 2},
 {"tag_name": "tag2", "tag_count": 3}
}

So which is better above json structure interms of INSERT and RETRIEVAL of trending count?

Q4) Can i go with existing {"key" : "value"} format instead of {"key_label" : key, "value_lable" : "value"} and possible to extract trending ..? since i am thinking that {"key" : "value"} is very straight forward and good at performance wise.

Q5) While retrieving i am using SUBSTRING(time_id, 1, 6) AS month. Will it be able to use index?

OR do i need to create multiple columns like time_hour(2018061023), time_day(20180610), time_month(201806) and use query on specific columns?

OR can i use mysql date-time functions? will that use index for faster retrieval?

Please help.

Kanagavelu Sugumar
  • 16,614
  • 19
  • 77
  • 92
  • Why are you using JSON? It should be a simple task if you used a normalized schema. Also it's not clear what you expect the SELECT query to return. – Paul Spiegel May 14 '18 at 07:24
  • it is josn because we are not sure on the number of tags, and what tags we will get in an hour, and there may new tags may arrive, hence nosql schema. Will update the SELECT query. – Kanagavelu Sugumar May 14 '18 at 07:30
  • How do you define a "trending tag"? – Nick May 14 '18 at 07:38
  • @Nick based on highest count value during time range. – Kanagavelu Sugumar May 14 '18 at 07:39
  • In your table definition I presume `PRIMARY KEY (partner_id, time_id)` should be `PRIMARY KEY (account, time_id)` or is there more to that table? – Nick May 14 '18 at 07:42
  • @Nick Nothing more, i just corrected question. – Kanagavelu Sugumar May 14 '18 at 08:20
  • I agree with @PaulSpiegel answer, it is much simpler than trying to use JSON. – Nick May 14 '18 at 08:22
  • @Nick But he didn't consider the optimisation and performance on his new suggestion on the use case. And many of my questions related mysql json are untouched. – Kanagavelu Sugumar May 14 '18 at 09:46
  • Agreed too about JSON not being the nicest solution here. If your issue is that you dont know the number of tags, you could use a many-to-many table or an EAV model. If one tag = 1 record, all the rest will be simpler – Thomas G May 14 '18 at 09:49
  • @ThomasG Atleast in json solution, account_id and time_id is not repeated in an hour aggregation. and number of rows is less, so index file content will also be less inherently faster updates since less index change. But with Paul answer every tag within the time is indexed. Not measured but thought process. – Kanagavelu Sugumar May 14 '18 at 10:01
  • @KanagaveluSugumar - Following your logic - the best/fastest solution would be to store all your data in a single huge BLOB (only one row). – Paul Spiegel May 14 '18 at 10:32
  • The big problem with JSON comes with trying to extract the data. As you saw with your query, you need to know the names of all the tags to be able to extract the values. So the only way to do it is either in the application framework or with a stored procedure. I don't think either of those is likely to be as efficient as @PaulSpiegel solution. – Nick May 14 '18 at 12:33

2 Answers2

4

I don't see a good reason, why you use JSON here. It's also not clear, why you believe that a "nosql schema" within MySQL would do anything better.

What you probably need is something like this:

CREATE TABLE TAG_COUNTER (
    account       varchar(36) NOT NULL,
    time_id       INT NOT NULL,
    tag_name      varchar(50) NOT NULL,
    counter       INT UNSIGNED NOT NULL,
    PRIMARY KEY   (account, time_id, tag_name)
);

This will simplify your queries. The INSERT statement would look like:

INSERT INTO TAG_COUNTER
  (account, time_id, tag_name, counter)
VALUES
  ('google', 2018061023, 'tag1', 1),
  ('google', 2018061023, 'tag2', 1)
ON DUPLICATE KEY UPDATE counter = counter + VALUES(counter);

The SELECT statement might be something like this

SELECT
    SUBSTRING(time_id, 1, 6) AS month,
    tag_name,
    SUM(counter) AS counter_agg
FROM TAG_COUNTER
GROUP BY month, tag_name
ORDER BY month, counter_agg DESC;

Note that I did't try to optimize the table/schema for data size and performance. That would be a different question. But you must see, that the queries are much simpler now.

Paul Spiegel
  • 27,993
  • 5
  • 38
  • 48
  • Thanks! But will this will create multiple rows in my db, if the account has 4000 tags, though in the same hour i would possible see most of the tags as a new row in the system, and this will repeat as for multiple accounts i have? Do you think this is not an issue ..? – Kanagavelu Sugumar May 14 '18 at 08:00
  • 1
    @KanagaveluSugumar - This will create one row per account, hour and tag. Usually it shouldn't be an issue. But if you will have too many inserts/updates - I don't think that JSON would be any faster. – Paul Spiegel May 14 '18 at 08:11
  • Okay, and regarding Q4, will SUBSTRING(time_id, 1, 6) AS month can use index? will it be faster searching the results ? OR do i need to create multiple columns like "time_hour(2018061023)", "time_day(20180610)", time_month(201806) and use query on specific columns ? – Kanagavelu Sugumar May 14 '18 at 08:18
  • @KanagaveluSugumar - `SUBSTRING(..)` will not use an index, unless you define an [indexed generated column](https://dev.mysql.com/doc/refman/5.7/en/create-table-secondary-indexes.html) with the same expression. – Paul Spiegel May 14 '18 at 08:24
  • Thanks! Paul, plusOne. Would you mind suggesting the better approach for my time series based trending use case by considering optimisation and performance. I think this is very common use case many developer faces. And it will help me a lot as well to start implementing it. Your experience/answer will be very helpful to the younger users like me. – Kanagavelu Sugumar May 14 '18 at 08:39
  • @KanagaveluSugumar - As I already wrote in my answer - Schema and query performance optimization is a different (and independent) question. Here I can just give you some hints: **a)** Use an index on virtual columns **b)** Use separate columns for year, moth, day and hour instead of time_id **c)** Use a calendar table (and/or a sequence table) for a left join with a range condition. **d)** Combine (a), (b) and (c). – Paul Spiegel May 14 '18 at 10:05
  • I strongly agree with tossing JSON. I agreed with Paul until his last comment. "b)" is likely to be a waste; "c)" will be a performance disaster . Keep `timeid` since it _can_ be used for ranges. – Rick James May 15 '18 at 17:20
1

As I have said in comments, I think moving away from JSON is the way to go. However, if you want to keep using JSON, this function (a direct copy of the one in my answer to this question, see an explanation of what it does there) and procedure will do what you want.

DELIMITER //
DROP FUNCTION IF EXISTS json_merge_sum //
CREATE FUNCTION json_sum_merge(IN j1 JSON, IN total JSON) RETURNS JSON
BEGIN
  DECLARE knum INT DEFAULT 0;
  DECLARE jkeys JSON DEFAULT JSON_KEYS(j1);
  DECLARE kpath VARCHAR(30);
  DECLARE v INT;
  DECLARE l INT DEFAULT JSON_LENGTH(jkeys);
  kloop: LOOP
    IF knum >= l THEN
      LEAVE kloop;
    END IF;
    SET kpath = CONCAT('$.', JSON_EXTRACT(jkeys, CONCAT('$[', knum, ']')));
    SET v = JSON_EXTRACT(j1, kpath);
    IF JSON_CONTAINS_PATH(total, 'one', kpath) THEN
      SET total = JSON_REPLACE(total, kpath, JSON_EXTRACT(total, kpath) + v);
    ELSE
      SET total = JSON_SET(total, kpath, v);
    END IF;
    SET knum = knum + 1;
  END LOOP kloop;
  RETURN total;
END //

The procedure is similar to the one in my other answer, in that it finds all the distinct tags associated with a given substring of time_id (specified as a parameter) and sums the values associated with each tag. The individual tags and counts are then written to a temporary table, from which a selection is then made grouping by time period and tag name.

DELIMITER //
DROP PROCEDURE IF EXISTS count_tags //
CREATE PROCEDURE count_tags(IN period VARCHAR(50))
BEGIN
  DECLARE finished INT DEFAULT 0;
  DECLARE timeval VARCHAR(20);
  DECLARE knum, l INT;
  DECLARE jkeys JSON;
  DECLARE time_cursor CURSOR FOR SELECT DISTINCT time_id FROM tag_counter;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET finished=1;
  CREATE TEMPORARY TABLE tag_counts (Time VARCHAR(20), Tag_Name VARCHAR(30), Tag_count_value INT, INDEX(Time, Tag_Name));
  OPEN time_cursor;
  time_loop: LOOP
    FETCH time_cursor INTO timeval;
    IF finished=1 THEN
      LEAVE time_loop;
    END IF;
    SET @total = '{}';
    SET @query = CONCAT("SELECT MIN(@total:=json_sum_merge(counters, @total)) INTO @json FROM TAG_COUNTER WHERE time_id='", timeval, "'");
    PREPARE stmt FROM @query;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
    SET @query = CONCAT('INSERT INTO tag_counts VALUES(', period, ', ?, ?)');
    PREPARE stmt FROM @query;
    SET @timeval = timeval;
    SET l = JSON_LENGTH(@total);
    SET jkeys = JSON_KEYS(@total);
    SET knum = 0;
    key_loop: LOOP
      IF knum >= l THEN
        LEAVE key_loop;
      END IF;
      SET @k = JSON_EXTRACT(jkeys, CONCAT('$[', knum, ']'));
      SET @t = JSON_EXTRACT(@total, CONCAT('$.', @k));
      EXECUTE stmt USING @k, @t;
      SET knum = knum + 1;
    END LOOP key_loop;
    DEALLOCATE PREPARE stmt;
  END LOOP time_loop;
  SELECT Time, Tag_Name, SUM(Tag_count_value) AS Tag_count_value FROM tag_counts GROUP BY Time, Tag_Name;
  DROP TABLE tag_counts;
END

A couple of examples based on some limited sample data from your prior question. In these examples @timeval is equivalent to the time_id column. Input data:

account     time_id     counters
google      20180510    {"gmail_page_viewed": 2, "search_page_viewed": 51}
google      20180511    {"gmail_page_viewed": 3, "search_page_viewed": 102}
apple       20180511    {"apple_page_viewed": 5, "search_page_viewed": 16}

CALL count_tags('@timeval'):

Time        Tag_Name                Tag_count_value
20180510    "gmail_page_viewed"     2
20180510    "search_page_viewed"    51
20180511    "apple_page_viewed"     5
20180511    "gmail_page_viewed"     3
20180511    "search_page_viewed"    118

CALL count_tags('SUBSTRING(@timeval, 1, 6)'):

Time    Tag_Name                Tag_count_value
201805  "apple_page_viewed"     5
201805  "gmail_page_viewed"     5
201805  "search_page_viewed"    169

Note that you can also use json_sum_merge to simplify your INSERT query e.g.

INSERT INTO `TAG_COUNTER`
  (`account`, `time_id`, `counters`)
VALUES
  ('apple', '20180511', '{"apple_page_viewed": 9, "itunes_page_viewed": 4}')
ON DUPLICATE KEY UPDATE `counters` = json_sum_merge(VALUES(counters), counters)

Result:

account     time_id     counters
apple       20180511    {"apple_page_viewed": 14, "itunes_page_viewed": 4, "search_page_viewed": 16}

In terms of the specific questions in your answer:

  1. No. This answer shows it can be done with your existing data format.
  2. Not applicable.
  3. Not applicable.
  4. Yes, you can stick with the existing {"key" : "value"} format
  5. Since we have to go through every entry of tag_counter to get the list of tags, an index is not beneficial for that section. For the temporary table I have included indexes on the Time and Tag_Name columns which should benefit speed as they are used directly in the GROUP BY clause.

If you were to maintain a list of keys (e.g. in a separate table, maintained by a trigger on insert/update/delete to tag_counter) this code could be made a lot simpler and more efficient. But that is for another question.

Nick
  • 118,076
  • 20
  • 42
  • 73
  • Thank you! Nick, for trying to give solution on the requirement and captured all the concerns what i have asked – Kanagavelu Sugumar May 21 '18 at 07:02
  • No worries. It was an interesting challenge and I hope what I've done helps you in some way even if you decide not to use it. – Nick May 21 '18 at 07:06
  • Nick, So what is your suggestion on Q5 on time_id datatype? `time_id INT NOT NULL,` This is needed to give trending based on hour or day or week in group by and where condition with start and end time to filter out. – Kanagavelu Sugumar May 21 '18 at 07:20
  • I think for the way you are using it a VARCHAR is actually the most appropriate. I don't see any advantage in changing to a DATE or TIME type. – Nick May 21 '18 at 07:22
  • No i am using it as INT to make it in where condition like `time_id > 20180511 AND time_id < 20180512` for a day report. Not VARCHAR. Is that fine ? – Kanagavelu Sugumar May 21 '18 at 07:25
  • INT should be fine. I wrote my procedure thinking of it as a VARCHAR but the automatic casting means it works anyway. – Nick May 21 '18 at 08:29