0

I have a periodic check of a certain query (which by the way includes multiple tables) to add informational messages to the user if something has changed since the last check (once a day).

I tried to make it work with checksum_agg(binary_checksum(*)), but it does not help, so this question doesn't help much, because I have a following case (oversimplified):

select  checksum_agg(binary_checksum(*))
from    
(
    select  1 as id,
            1 as status

    union all

    select  2 as id,
            0 as status
) data

and

select  checksum_agg(binary_checksum(*))
from    
(
    select  1 as id,
            0 as status

    union all

    select  2 as id,
            1 as status
) data

Both of the above cases result in the same check-sum, 49, and it is clear that the data has been changed.

This doesn't have to be a simple function or a simple solution, but I need some way to uniquely identify the difference like these in SQL server 2000.

Community
  • 1
  • 1
Miljenko Barbir
  • 1,133
  • 1
  • 9
  • 31
  • Related question: http://stackoverflow.com/questions/7362312/is-there-a-function-feature-in-sql-server-to-determine-if-a-table-has-any-recen – David Brabant May 03 '12 at 09:47
  • 1
    Try adding a `timestamp`/`rowversion` column to your base tables. – wqw May 03 '12 at 11:24

1 Answers1

1

checksum_agg appears to simply add the results of binary_checksum together for all rows. Although each row has changed, the sum of the two checksums has not (i.e. 17+32 = 16+33). This is not really the norm for checking for updates, but the recommendations I can come up with are as follows:

  1. Instead of using checksum_agg, concatenate the checksums into a delimited string, and compare strings, along the lines of SELECT binary_checksum(*) + ',' FROM MyTable FOR XML PATH(''). Much longer string to check and to store, but there will be much less chance of a false positive comparison.
  2. Instead of using the built-in checksum routine, use HASHBYTES to calculate MD5 checksums in 8000 byte blocks, and xor the results together. This will give you a much more resilient checksum, although still not bullet-proof (i.e. it is still possible to get false matches, but very much less likely). I'll paste the HASHBYTES demo code that I wrote below.
  3. The last option, and absolute last resort, is to actually store the table table in XML format, and compare that. This is really the only way you can be absolutely certain of no false matches, but is not scalable and involves storing and comparing large amounts of data.

Every approach, including the one you started with, has pros and cons, with varying degrees of data size and processing requirements against accuracy. Depending on what level of accuracy you require, use the appropriate option. The only way to get 100% accuracy is to store all of the table data.

Alternatively, you can add a date_modified field to each table, which is set to GetDate() using after insert and update triggers. You can do SELECT COUNT(*) FROM #test WHERE date_modified > @date_last_checked. This is a more common way of checking for updates. The downside of this one is that deletions cannot be tracked.

Another approach is to create a modified table, with table_name (VARCHAR) and is_modified (BIT) fields, containing one row for each table you wish to track. Using insert, update and delete triggers, the flag against the relevant table is set to True. When you run your schedule, you check and reset the is_modified flag (in the same transaction) - along the lines of SELECT @is_modified = is_modified, is_modified = 0 FROM tblModified

The following script generates three result sets, each corresponding with the numbered list earlier in this response. I have commented which output correspond with which option, just before the SELECT statement. To see how the output was derived, you can work backwards through the code.

-- Create the test table and populate it
CREATE TABLE #Test (
    f1 INT,
    f2 INT
)
INSERT INTO #Test VALUES(1, 1)
INSERT INTO #Test VALUES(2, 0)
INSERT INTO #Test VALUES(2, 1)

/*******************
OPTION 1
*******************/
SELECT CAST(binary_checksum(*) AS VARCHAR) + ',' FROM #test FOR XML PATH('')

-- Declaration: Input and output MD5 checksums (@in and @out), input string (@input), and counter (@i)
DECLARE @in VARBINARY(16), @out VARBINARY(16), @input VARCHAR(MAX), @i INT

-- Initialize @input string as the XML dump of the table
-- Use this as your comparison string if you choose to not use the MD5 checksum
SET @input = (SELECT * FROM #Test FOR XML RAW)

/*******************
OPTION 3
*******************/
SELECT @input

-- Initialise counter and output MD5.
SET @i = 1
SET @out = 0x00000000000000000000000000000000
WHILE @i <= LEN(@input)
BEGIN
    -- calculate MD5 for this batch
    SET @in = HASHBYTES('MD5', SUBSTRING(@input, @i, CASE WHEN LEN(@input) - @i > 8000 THEN 8000 ELSE LEN(@input) - @i END))
    -- xor the results with the output
    SET @out = CAST(CAST(SUBSTRING(@in, 1, 4) AS INT) ^ CAST(SUBSTRING(@out, 1, 4) AS INT) AS VARBINARY(4)) +
        CAST(CAST(SUBSTRING(@in, 5, 4) AS INT) ^ CAST(SUBSTRING(@out, 5, 4) AS INT) AS VARBINARY(4)) +
        CAST(CAST(SUBSTRING(@in, 9, 4) AS INT) ^ CAST(SUBSTRING(@out, 9, 4) AS INT) AS VARBINARY(4)) +
        CAST(CAST(SUBSTRING(@in, 13, 4) AS INT) ^ CAST(SUBSTRING(@out, 13, 4) AS INT) AS VARBINARY(4))
    SET @i = @i + 8000
END

/*******************
OPTION 2
*******************/
SELECT @out
Peter
  • 995
  • 5
  • 7
  • Unfortunately 'hashbytes' is not a recognized function name in SQL Server 2000. – Miljenko Barbir May 04 '12 at 05:28
  • 1
    Aaah SQL2000. I missed that one! I found an implementation of MD for T-SQL over here: http://www.hackchina.com/en/r/184114/MD5.sql__html. Apologies for the oversight. – Peter May 04 '12 at 06:07
  • I don't believe someone actually wrote this, thanks for the script and the solution. – Miljenko Barbir May 04 '12 at 07:48
  • Just a note for all those reading this question: I replaced `checksum_agg(binary_checksum(*))` from the query in the question with `checksum_agg(binary_checksum(dbo.MD5(cast(checksum(*) as nvarchar(4000)))))` which actually fits as a solution to my problem... – Miljenko Barbir May 04 '12 at 07:52