10

I have the following query optimization problem in Spanner, and hoping there's a trick I'm missing that will help me bend the query planner to my will.

Here's the simplified schema:

create table T0 (
  key0  int64 not null,
  value int64,
  other int64 not null,
) primary key (key0);

create table T1 {
  key1  int64 not null,
  other int64 not null
} primary key (key1);

And a query with a subquery in an IN clause:

select value from T0 t0
where t0.other in (
  select t1.other from T1 t1 where t1.key1 in (42, 43, 44)  -- note: this subquery is a good deal more complex than this
)

Which produces a 10 element set, via a hash join of T0 against the output of the subquery:

Operator                     Rows  Executions
-----------------------      ----- ----------
Serialize Result               10          1
Hash Join                      10          1
  Distributed union         10000          1
    Local distributed union 10000          1
    Table Scan: T0          10000          1
  Distributed cross apply:      5          1
   ...lots moar T1 subquery stuff...

Note that, while the subquery is complex, it actually produces a very small set. Unfortunately, it also scans the entirety of T1 to feed to the hash join, which is very slow.

However, if I take the output of the subquery on T1 and manually shove it into the IN clause:

select value from T0
where other in (5, 6, 7, 8, 9)  -- presume this `IN` clause to be the output of the above subquery

It is dramatically faster, presumably because it just hits T0's index once per entry, not using a hash join on the full contents:

Operator                Rows Executions
----------------------- ---- ----------
Distributed union         10          1
Local distributed union   10          1
Serialize Result          10          1
Filter                    10          1
Index Scan:               10          1

I could simply run two queries, and that's my best plan so far. But I'm hoping I can find some way to cajole Spanner into deciding that this is what it ought to do with the output of the subquery in the first example. I've tried everything I can think of, but this may simply not be expressible in SQL at all.

Also: I haven't quite proven this yet, but in some cases I fear that the 10 element subquery output could blow up to a few thousand elements (T1 will grow more or less without bound, easily to millions). I've manually tested with a few hundred elements in the splatted-out IN clause and it seems to perform acceptably, but I'm a little concerned it could get out of hand.

Note that I also tried a join on the subquery, like so:

select t0.other from T0 t0
join (
  -- Yes, this could be a simple join rather than a subquery, but in practice it's complex
  -- enough that it can't be expressed that way.
  select t1.other from T1 t1 where t1.key = 42
) sub on sub.other = t0.other

But it did something truly horrifying in the query planner, that I won't even try to explain here.

Maxim
  • 3,172
  • 9
  • 21
Joel Webber
  • 222
  • 1
  • 7
  • The subquery as written is a bit confusing: Did you mean to say `key1` instead of `key`? Also: As written, the subquery could only possibly return one result since key1 is the full primary key; perhaps you should have two primary keys for T1, or you could say `t1.key1 IN (42, 43, 44)` ? – Mike Curtiss Jun 12 '17 at 21:48
  • Whoops, sorry -- just noticed this comment. Yeah, that's a mistake I made as I was trying to abstract the problem. It should read basically as you suggest. I'll edit it to reflect that to avoid future confusion. – Joel Webber Jun 14 '17 at 14:08

1 Answers1

2

Does your actual subquery in the IN clause use any variables from T0? If not, what happens if you try your join query with the tables reordered (and a distinct added for correctness, unless you know that the values will be distinct)?

SELECT t0.other FROM  (
      -- Yes, this could be a simple join rather than a subquery, but in practice it's complex
      -- enough that it can't be expressed that way.
      SELECT DISTINCT t1.other FROM T1 t1 WHERE t1.key = 42
    ) sub 
JOIN T0 t0
ON sub.other = t0.other
Mike Curtiss
  • 1,428
  • 11
  • 30
  • Thanks, Mike, that's perfect! We just came up with roughly the same answer over the weekend (minus the `distinct` part, which I think might be necessary as well). That scans rows roughly `O(sub-query)`, which is exactly what I'd hoped for. – Joel Webber Jun 13 '17 at 13:21
  • You're welcome. I'm actually just passing along the answer from someone else in our team who doesn't have a SO account, but we're glad it worked out! – Mike Curtiss Jun 14 '17 at 02:15