How can I determine the difference between two large datasets?

Question

I have large datasets with millions of records in XML format. These datasets are full data dumps of a database up to a certain point in time.

Between two dumps new entries might have been added and existing ones might have been modified or deleted. Assume the schema remains unchanged and that every entry has a unique ID.

What would be the best way to determine the delta between two of these datasets (including deletions and updates)?

My plan is to load everything to an RDBMS and go from there.

First, load the older dump. Then, load the newer dump into a different schema, but in doing so I'll check if the entry is new or is an update to an existing entry. If yes, I'll log the ID on a new table(s) called "changes."

After this is all done, I'll go through the old dump going through all entries and see if they have a matching record (ie: same ID) on the new dump. If not, log to changes.

Assuming looking up a record by ID is a O(log n) operation, this should allow me to do everything in O(n log n) time.

Because I can determine the difference by looking at presence or absence of records with just the ID and the last modification date, I could also load everything in main memory as well. The time complexity will be the same, but with the added benefit of less disk I/O, which should make this faster by orders of magnitude.

Suggestions? (Note: This is more of a performance question than anything)

"Because I can determine...which should make this faster by orders of magnitude". "This is more of a performance question than anything". ...sooo doing this in memory will be much quicker, and you're primarily concerned with performance. Sounds like you answered your own question. — Gerrat, Sep 06 '11 at 20:38

score 1 · Answer 1 · answered Sep 06 '11 at 17:51

1

RedGate's SQL Data Compare

answered Sep 06 '11 at 17:51

adamcodes

1,608
13
21

score 1 · Answer 2 · answered Sep 06 '11 at 19:45

1

Look at DeltaXML.

(padded because StackOverflow doesn't allow short answers)

answered Sep 06 '11 at 19:45

Michael Kay

138,236
10
76
143

score 0 · Answer 3 · answered Sep 06 '11 at 17:48

0

As an unusual suggestion, consider using git for this. Bring the first dataset under version control, then clean your working directory and copy in the second dataset. git is damn fast at bringing up the difference.

answered Sep 06 '11 at 17:48

Andomar

216,619
41
352
379

Can git handle that if the records are in no particular order (ie: the order is not guaranteed to stay the same)? – NullUserException Sep 06 '11 at 17:57
@NullUserException: git works on file structures. If you're talking about the Stack Overflow export, you could store each question XML in a file questionid.xml (not sure, never looked at the export in detail.) – Andomar Sep 06 '11 at 18:04
All the questions are in the same XML file... I really want to avoid creating millions xml files... – NullUserException Sep 06 '11 at 18:06

score 0 · Answer 4 · edited May 23 '17 at 12:04

Take a look at this post on MSDN, which provides a solution for getting the differences between two DataTables. It should point you in the right direction:

How to compare two DataTables:
http://social.msdn.microsoft.com/Forums/en/csharpgeneral/thread/23703a85-20c7-4759-806a-fabf4e9f5be6

You might also want to take a look at this SO question too:
Compare two DataTables to determine rows in one but not the other

I've also seen this approach used a few times:

table1.Merge(table2);
DataTable changesTable = table1.GetChanges();

score 0 · Answer 5 · answered Sep 06 '11 at 18:00

0

select
    coalesce(a.id, b.id) as id,
    case 
        when a.id is null then 'included' 
        when b.id is null then 'deleted'
        when a.col != b.col then 'updated'
    end as status
from a
full outer join b on a.id = b.id
where a.id is null or b.id is null or a.col != b.col

answered Sep 06 '11 at 18:00

Clodoaldo Neto

98,807
21
191
235

I know how to do it, I am more concerned about the performance of a query like this. – NullUserException Sep 06 '11 at 18:02
@Null The title asks how to determine the difference not how to do it fast. Also it looks like you want to create a loop and that would be bad. – Clodoaldo Neto Sep 06 '11 at 18:18
How do you suggest I load the data without a loop? – NullUserException Sep 06 '11 at 18:19
@Null Just use the query I wrote. No loops. – Clodoaldo Neto Sep 06 '11 at 18:20
The data is in XML files, not a database. If I am going to use a database, I have to go through the files at some point. – NullUserException Sep 06 '11 at 18:22

How can I determine the difference between two large datasets?

5 Answers5