Why is a UDF so much slower than a subquery?

Question

I have a case where I need to translate (lookup) several values from the same table. The first way I wrote it, was using subqueries:

SELECT
    (SELECT id FROM user WHERE user_pk = created_by) AS creator,
    (SELECT id FROM user WHERE user_pk = updated_by) AS updater,
    (SELECT id FROM user WHERE user_pk = owned_by) AS owner,
    [name]
FROM asset

As I'm using this subquery a lot (that is, I have about 50 tables with these fields), and I might need to add some more code to the subquery (for example, "AND active = 1" ) I thought I'd put these into a user-defined function UDF and use that. But the performance using that UDF was abysmal.

CREATE FUNCTION dbo.get_user ( @user_pk INT )
RETURNS INT
AS BEGIN 
    RETURN ( SELECT id
             FROM   ice.dbo.[user]
             WHERE  user_pk = @user_pk )
END

SELECT dbo.get_user(created_by) as creator, [name]
FROM asset

The performance of #1 is less than 1 second. Performance of #2 is about 30 seconds...

Why, or more importantly, is there any way I can code in SQL server 2008, so that I don't have to use so many subqueries?

Edit:

Just a litte more explanation of when this is useful. This simple query (that is, get userid) gets a lot more complex when I want to have a text for a user, since I have to join with profile to get the language, with a company to see if the language should be fetch'ed from there instead, and with the translation table to get the translated text. And for most of these queries, performance is a secondary issue to readability and maintainability.

score 34 · Accepted Answer · edited Jun 28 '19 at 10:50

34

The UDF is a black box to the query optimiser so it's executed for every row. You are doing a row-by-row cursor. For each row in an asset, look up an id three times in another table. This happens when you use scalar or multi-statement UDFs (In-line UDFs are simply macros that expand into the outer query)

One of many articles on the problem is "Scalar functions, inlining, and performance: An entertaining title for a boring post".

The sub-queries can be optimised to correlate and avoid the row-by-row operations.

What you really want is this:

SELECT
   uc.id AS creator,
   uu.id AS updater,
   uo.id AS owner,
   a.[name]
FROM
    asset a
    JOIN
    user uc ON uc.user_pk = a.created_by
    JOIN
    user uu ON uu.user_pk = a.updated_by
    JOIN
    user uo ON uo.user_pk = a.owned_by

Update Feb 2019

SQL Server 2019 starts to fix this problem.

edited Jun 28 '19 at 10:50

deHaar

11,298
10
32
38

answered Feb 04 '09 at 10:09

gbn

394,550
75
549
647

3

To be equivalent to his original query he actually needs a left join, not an inner join, because the nested queries may return null – Scott Wisniewski Feb 04 '09 at 10:13
True, to be 100% like for like – gbn Feb 04 '09 at 10:43
Good explanation on the why the UDF kills performance, what throws me off is that the execution plan shows 3 steps: index seek 36%, Compute scalar 1% and hash match 63%. Anyway: a join is NOT an option, allready unreadable code will only be a LOOT worse. I'm looking for other solutions. – devzero Feb 04 '09 at 10:54
3

devzero, the join is the absolute fastest solution and is not unreadable to anyone familiar with t-sql.Joins are almost always far better than subqueries and tremendously better than UDFs which should be avoided at all costs. – HLGEM Feb 04 '09 at 14:48
HLGEM: the performance gain using joins is not so much compared to subselects, and have when one of the joins above is expanded into 4 join it's NOT readable. (ie the total number of joins above will be 13, and for my real query it would be about 25+) – devzero Feb 06 '09 at 14:31
The syntax for CTE and inline function is actually just about the same when used (declaration is very diffrent). So pro CTE: it does not clutter the DB, pro UDF: it can be reused in several queries, and thus better to maintain. I guess a view could work just as well as a UDF. – devzero Feb 06 '09 at 18:43
1

Reusability has a place: but udfs in this context are a performance killer. I'd use this udf to resolve a single name in places, not for a set based operation. I also wouldn't use a view, personally, because the temptation is to reuse it and kill performance again. – gbn Feb 07 '09 at 07:33
"The udf is a black box to the query optimiser so it's executed every row" sold me to upvote this question. I was never sure why execution plan did not calculate UDF well... but now i have a better idea why. – dance2die Sep 19 '09 at 15:56
1

With SQL Server 2019, this behavior changes due to scalar UDF inlining (based on the Froid VLDB paper). Scalar UDFs are no longer black boxes to the QO. Froid can decorrelate queries inside of UDFs and also enable parallelism. – Karthik Feb 08 '19 at 11:10

LukeH · Answer 2 · 2009-02-06T10:23:24.653

13

As other posters have suggested, using joins will definitely give you the best overall performance.

However, since you've stated that that you don't want the headache of maintaining 50-ish similar joins or subqueries, try using an inline table-valued function as follows:

CREATE FUNCTION dbo.get_user_inline (@user_pk INT)
RETURNS TABLE AS
RETURN
(
    SELECT TOP 1 id
    FROM ice.dbo.[user]
    WHERE user_pk = @user_pk
        -- AND active = 1
)

Your original query would then become something like:

SELECT
    (SELECT TOP 1 id FROM dbo.get_user_inline(created_by)) AS creator,
    (SELECT TOP 1 id FROM dbo.get_user_inline(updated_by)) AS updater,
    (SELECT TOP 1 id FROM dbo.get_user_inline(owned_by)) AS owner,
    [name]
FROM asset

An inline table-valued function should have better performance than either a scalar function or a multistatement table-valued function.

The performance should be roughly equivalent to your original query, but any future changes can be made in the UDF, making it much more maintainable.

edited Feb 06 '09 at 10:23

answered Feb 04 '09 at 11:54

LukeH

242,140
52
350
400

Just answering the question! I agree with you that JOINs are the best way to do this, but the OP indicated in their question that they wanted to do this with a UDF but were disappointed with the performance. They also state in their comment to your post that "join is NOT an option". – LukeH Feb 04 '09 at 22:48
I'm aware that join's will give the best performance, but it gives me 20+ joins all in all. This is not easy to maintain, and the performance gain (0.00x seconds to 0.0x seconds or something) isn't an issue. – devzero Feb 06 '09 at 14:25
UDF's used as above might seem worse to read when it's simplified like here, but when you start adding to it, it's much better. In my case I have another similar UDF that contains a tripple join, with 4 where clauses. Try repeating this 3 times inside a select and you get the point. – devzero Feb 06 '09 at 14:28
@devzero, Did you just accept this answer and then change your mind? Your comments seem to suggest that this is your preferred solution. – LukeH Feb 06 '09 at 14:34
yes, this is the preferred solution, however, this is solution is the same as the solution in the link from the answer above. And he was first. If I could I would accept both answers, as they are both correct. – devzero Feb 06 '09 at 18:45
This should be the accepted solution, I tried this before, I had an scalar function and I have a query WHERE X = and I got better performance by simply changing the function invocation by a subquery: WHERE X = (select ) – Israel Garcia Jun 07 '18 at 20:14

score 2 · Answer 3 · answered Feb 04 '09 at 10:40

To get the same result (NULL if user is deleted or not active).

 select 
    u1.id as creator,
    u2.id as updater,
    u3.id as owner,
    [a.name]
 FROM asset a
        LEFT JOIN user u1 ON (u1.user_pk = a.created_by AND u1.active=1) 
        LEFT JOIN user u2 ON (u2.user_pk = a.created_by AND u2.active=1) 
        LEFT JOIN user u3 ON (u3.user_pk = a.created_by AND u3.active=1)

score 0 · Answer 4 · edited Oct 22 '11 at 16:21

0

Am I missing something? Why can't this work? You are only selecting the id which you already have in the table:

select created_by as creator, updated_by as updater, 
owned_by as owner, [name]
from asset

By the way, in designing you really should avoid keywords, like name, as field names.

edited Oct 22 '11 at 16:21

Peter Mortensen

28,342
21
95
123

answered Feb 04 '09 at 14:53

HLGEM

88,902
11
105
171

"id" is the required column and it's only in the "user" table. "user_pk" is the reference column that's present in both "asset" and "user". – LukeH Feb 04 '09 at 15:00
To clarify, "asset.created_by", "asset.updated_by" and "asset.owned_by" refer to "user.user_pk", and the column that's required is the corresponding "user.id". – LukeH Feb 04 '09 at 15:02
Also in our discussions we have come to the conclusion that "name" is not a bad field name. This is because most of the time the field name will be prefixed (ie user.name) and this is better than f.eks table_user.user_name or some such. – devzero Feb 06 '09 at 18:48
It is a bad field name because it is a reserved word. You really should try to avoid those as they can create unnecessary bugs when people forget, as they inevitably do, to qualify them properly. – HLGEM Feb 06 '09 at 19:06

Why is a UDF so much slower than a subquery?

Edit:

4 Answers4

Linked