2

I have a table with some 50 column count. Every time there is change in row I do not have a knowledge that which columns will change. I don't want to deal with each and every permutation and combination when updating the table.

So when I have to do so I am updating all 50 columns and which, I know, takes much more time than my expectation when dealing with huge number of updates.

  1. To address this I have one solution. Create different set of frequently and together updated fields and design my application that way. Which I know will require change whenever new field is added to my table.
UPDATE TBLX SET NAME = ? WHERE ID = ?;

Result of Explain Update...

UPDATE
 INDEX SCAN of "TBLX" using "TBLX_ID"
  scan matches for (U_ID = ?5), filter by (column#0 = ?6)
  1. Another approach is that I write a query with when and then(as shown below). This way my code will need update but not as much as it might require in first approach.
UPDATE TBLX SET NAME = CASE WHEN (? != '####') THEN ? ELSE NAME END WHERE ID = ?;

Result of Explain Update...

UPDATE
 INDEX SCAN of "TBLX" using "TBLX_ID"
  scan matches for (U_ID = ?3), filter by (column#0 = ?4)

So my question is about the internal of the query execution. How both type of the query will be treated and which one will work faster.

Thing I want to understand is whether executor will ignore the part of the query where I am not changing value in column. i.e. assign same value to the column.

Kartik73
  • 491
  • 6
  • 17
  • `UPDATE TBLX SET NAME = ? WHERE ? != '####'`? – jarlh Aug 06 '19 at 11:05
  • @jarlh You are right, but my main challenge is that I do have 50 fields and the update query will also have 50 fields but the actual updated values are for 3 fields. So technically I don't want the DBMS to update all other fields. It costs us in performance when this happens. – Kartik73 Aug 06 '19 at 11:19
  • You can use "EXPLAIN UPDATE ..." to get the query plan for each query and see what looks more efficient. Another option is either the UPSERT statement or the default TBLX.upsert procedure, which is auto-generated if the table has a primary key. – BenjaminBallard Aug 06 '19 at 16:07
  • Have you tried comparing them with explainproc? https://docs.voltdb.com/UsingVoltDB/sysprocexplainproc.php – Andrew Aug 06 '19 at 16:59
  • @BenjaminBallard I have updated the question with the outputs of command 'Explain Update ...' it does not reveal internal working of VoltDB. All I want to understand is whether it will ignore the column when its value is not changed by assigning its own value. – Kartik73 Aug 07 '19 at 10:22

1 Answers1

1

The plans show that both queries are using a match on the TBLX_ID index, which is the fastest way to find the particular row or rows to be updated. If it is a single row, this should be quite fast.

The difference between these two queries is essentially what it is doing for the update work once it has found the row. While the plan doesn't show the steps it will take when updating one row, it should be fast either way. At that point, it's native C++ code updating a row in memory that it has exclusive access to. If I had to guess, the one using the CASE clause may take slightly longer, but it could be a negligible difference. You'd have to run a benchmark to measure the difference in execution times to be certain, but I would expect it to be fast in both cases.

What would be more significant than the difference between these two updates is how you handle updating multiple columns. For example, the cost of finding the affected row may be higher than the logic of the actual updates to the columns. Or, at least if you desiged it so that in order to update n columns you have to queue n SQL statements, then the engine has to execute n statements, and use the same index to find the same row n times. All of that overhead would be much more signficant. If instead you had a complex UPDATE statement with many parameters where you could pass in different values to update various columns or set them to their current value, but in all of that the engine only has to execute one statement and find the row once, then even though that seems complex, it would probably be faster. Faster still may be to simply update all of the columns to the new values, whether it is the same as the current value or not.

If you can test this and run a few hundred examples, then the "exec @Statistices PROCEDUREDETAIL 0;" output will show the average execution time for each of the SQL statements as well as the entire procedure. That should provide the metrics necessary to find the optimum way to do this.

BenjaminBallard
  • 1,472
  • 11
  • 11
  • Thank you for elaborated explanation. It seems that only benchmark run will tell me if writing complex query saves me time or I will have to go some other way. – Kartik73 Aug 07 '19 at 19:13