6

I have large datasets with millions of records in XML format. These datasets are full data dumps of a database up to a certain point in time.

Between two dumps new entries might have been added and existing ones might have been modified or deleted. Assume the schema remains unchanged and that every entry has a unique ID.

What would be the best way to determine the delta between two of these datasets (including deletions and updates)?


My plan is to load everything to an RDBMS and go from there.

First, load the older dump. Then, load the newer dump into a different schema, but in doing so I'll check if the entry is new or is an update to an existing entry. If yes, I'll log the ID on a new table(s) called "changes."

After this is all done, I'll go through the old dump going through all entries and see if they have a matching record (ie: same ID) on the new dump. If not, log to changes.

Assuming looking up a record by ID is a O(log n) operation, this should allow me to do everything in O(n log n) time.

Because I can determine the difference by looking at presence or absence of records with just the ID and the last modification date, I could also load everything in main memory as well. The time complexity will be the same, but with the added benefit of less disk I/O, which should make this faster by orders of magnitude.

Suggestions? (Note: This is more of a performance question than anything)

Community
  • 1
  • 1
NullUserException
  • 77,975
  • 25
  • 199
  • 226
  • "Because I can determine...which should make this faster by orders of magnitude". "This is more of a performance question than anything". ...sooo doing this in memory will be much quicker, and you're primarily concerned with performance. Sounds like you answered your own question. – Gerrat Sep 06 '11 at 20:38

5 Answers5

1

RedGate's SQL Data Compare

adamcodes
  • 1,608
  • 13
  • 21
1

Look at DeltaXML.

(padded because StackOverflow doesn't allow short answers)

Michael Kay
  • 138,236
  • 10
  • 76
  • 143
0

As an unusual suggestion, consider using git for this. Bring the first dataset under version control, then clean your working directory and copy in the second dataset. git is damn fast at bringing up the difference.

Andomar
  • 216,619
  • 41
  • 352
  • 379
  • Can git handle that if the records are in no particular order (ie: the order is not guaranteed to stay the same)? – NullUserException Sep 06 '11 at 17:57
  • @NullUserException: git works on file structures. If you're talking about the Stack Overflow export, you could store each question XML in a file questionid.xml (not sure, never looked at the export in detail.) – Andomar Sep 06 '11 at 18:04
  • All the questions are in the same XML file... I really want to avoid creating millions xml files... – NullUserException Sep 06 '11 at 18:06
0

Take a look at this post on MSDN, which provides a solution for getting the differences between two DataTables. It should point you in the right direction:

How to compare two DataTables:
http://social.msdn.microsoft.com/Forums/en/csharpgeneral/thread/23703a85-20c7-4759-806a-fabf4e9f5be6

You might also want to take a look at this SO question too:
Compare two DataTables to determine rows in one but not the other

I've also seen this approach used a few times:

table1.Merge(table2);
DataTable changesTable = table1.GetChanges();
Community
  • 1
  • 1
James Johnson
  • 43,670
  • 6
  • 67
  • 106
0
select
    coalesce(a.id, b.id) as id,
    case 
        when a.id is null then 'included' 
        when b.id is null then 'deleted'
        when a.col != b.col then 'updated'
    end as status
from a
full outer join b on a.id = b.id
where a.id is null or b.id is null or a.col != b.col
Clodoaldo Neto
  • 98,807
  • 21
  • 191
  • 235