0

Let's say we're handling the advertising of various job openings across several channels (newspapers, job boards, etc.). For each channel, we can buy a "publication period" which will mean the channel will advertise our job openings during that period. How can we find the jobs for a given channel that have a publication period valid for today (i.e. starting on or before today, and ending on or after today)? The intent is to be able to generate a feed of "active" job openings that (e.g.) a job board can consume periodically to determine which jobs should be displayed to its users.

Another wrinkle is that each job opening is associated with a given tenant id: the feeds will have to be generated scoped to tenant and channel.

Let's say we have the following simplified documents (if you think the data should be modeled differently, please let me know also):

{
  "_id": "A",
  "tenant_id": "foo",
  "name": "Job A",
  "publication_periods": [
    {
      "channel": "linkedin",
      "start": "2021-03-10T00:00:0.0Z",
      "end": "2021-03-17T00:00:0.0Z"
    },
    {
      "channel": "linkedin",
      "start": "2021-04-10T00:00:0.0Z",
      "end": "2021-04-17T00:00:0.0Z"
    },
    {
      "channel": "monster.com",
      "start": "2021-03-10T00:00:0.0Z",
      "end": "2021-03-17T00:00:0.0Z"
    }
  ]
}


{
  "_id": "B",
  "tenant_id": "foo",
  "name": "Job B",
  "publication_periods": [
    {
      "channel": "linkedin",
      "start": "2021-04-10T00:00:0.0Z",
      "end": "2021-04-17T00:00:0.0Z"
    },
    {
      "channel": "monster.com",
      "start": "2021-03-15T00:00:0.0Z",
      "end": "2021-03-20T00:00:0.0Z"
    }
  ]
}


{
  "_id": "C",
  "tenant_id": "foo",
  "name": "Job C",
  "publication_periods": [
    {
      "channel": "monster.com",
      "start": "2021-05-15T00:00:0.0Z",
      "end": "2021-05-20T00:00:0.0Z"
    }
  ]
}


{
  "_id": "D",
  "tenant_id": "bar",
  "name": "Job D",
  "publication_periods": [
    ...
  ]
}

How can I query the jobs linked to tenant "foo" that have an active publication period for "monster.com" on for the date of 17.03.2021? (I.e. this query should return both jobs A and B.)

Note that the DB will contain documents of other (irrelevant) types.

Since I essentially need to "find all job openings containing an object in the publication_periods array having: CHAN as the channel value, "start" <= DATE, "end" >= DATE" it appears I'd require a Mango query to achieve this, as standard view queries don't provide comparison operators (if this is mistaken, please correct me).

Naturally, I want the Mango query to be executed only on relevant data (i.e. exclude documents that aren't job openings), but I can find references on how to do this (whether in the docs or elsewhere): all resources I found simply seem to define the Mango index on the entire set of documents, relying on the fact that documents where the indexed field is absent won't be indexed.

How can I achieve what I'm after?

Initially, I was thinking of creating a view that would emit the publication period information along with a {'_id': id} object in order to "JOIN" the job opening document to the matching periods at query time (per Best way to do one-to-many "JOIN" in CouchDB). However, I realized that I wouldn't be able to query this view as needed (i.e. "start" value before today, "end" value after today) since I wouldn't have a definite start/end key to use... And I have no idea how to properly leverage a Mango index/query for this. Presumably I'd have to create a partial index based on document type and the presence of publication periods, but how can I even index the multiple publication periods that can be located within a single document? Can a Mango index be defined against a specific view as opposed to all documents in the DB?

David Sulc
  • 25,448
  • 3
  • 49
  • 52

1 Answers1

0

I stumbled upon this answer Mango search in Arrays indicating that I should be able to index the data with

{
   "index": {
      "fields": [
          "tenant_id",
         "publication_periods.[].channel",
         "publication_periods.[].start",
         "publication_periods.[].end"
      ]
   },
   "ddoc": "job-openings-periods-index",
   "type": "json"
}

And then query them with

{
   "selector": {
      "tenant_id": "foo",
      "publication_periods": {
        "$elemMatch": {
            "$and": [
              {
                "channel": "monster.com"
              },
              {
                "start": {
                  "$lte": "2021-03-17T00:00:0.0Z"
                }
              },
              {
                "end": {
                  "$gte": "2021-03-17T00:00:0.0Z"
                }
              }
            ]
          }
      }
   },
   "use_index": "job-openings-periods-index"
   "execution_stats": true
}

Sadly, I'm informed that the index "was not used because it does not contain a valid index for this query" and terrible performance, which I will leave for another question.

David Sulc
  • 25,448
  • 3
  • 49
  • 52
  • 1
    Beware `$elemMatch` queries are perfomed in memory. Partial indexes can mitigate overhead, but as the index grows the bonus for the partial index is diminished. The noSQL mantra is "denormalize, denormalize, denormalize" makes sense, but there are cases where too much denormalization is not good. If it is possible to break each `publication_periods` element out into individual documents, I'd do so to avoid the in-memory processing. – RamblinRose Mar 22 '21 at 12:20
  • 1
    Thanks for letting me know `$elemMatch` queries are performed in memory (why then have a syntax to index them?). In addition, I'm not able to have a query use an index when an array attribute is both indexed (i.e. in the `fields` array) and queried for via `$elemMatch`. So in the end, I'll likely leverage a "normal" view with an array index containing the "end" timestamp (and emit one row per publication period) which will enable ignoring old data when querying, while the rejecting of future publication periods will take place in application code... – David Sulc Mar 24 '21 at 16:19