Optimizing SSIS package for millions of rows with Order by / sort in SQL command and Merge Join

Question

Hi i am currently trying to optimize an SSIS package when i do an Upsert/delete which handels about 93+ million rows from a table in a remote MariaDB source. The table also contains approximately 63 columns.

Currently i'm using Sort and Merge Join in my package but as i've read some guides its recommended to do the sorting part in the server and not with the sort functionality in SSIS DT, as it puts a load on the SSIS Server Memory.

And as i'm currently using this solution in Azure Data Factory running the package fails (most often Times out, even though i've increased the Time Out properties both in package side and in Azure Data factory).

What is the recommended way to tackle this?

If i've understood it right and as i mentioned it before i can skip the load on the SISS server by sorting DB-Server-Side. But as i'm new to the whole SQL and SSIS stuff i'm not quite sure how a sort like that would be in the SQL Command.

Also i've though about batching but even here i'm uncertain how that would work in SSIS.

What is recommended here?

My SSIS-Package looks like this right now:

I Followed this type of example: Synchronize Table Data Using a Merge Join in SSIS

(FYI: The red error icons are there because i lost connection during the screenshot, this is a fully working solution otherwise.)

Hadi · Answer 1 · 2019-05-02T21:19:15.487

1

I have two recommendations:

Server side sorting

In OLE DB Source change the access mode to SQL Command. And use ORDER BY clause:

Select * from table ORDER BY col1, col2

After that you should open OLE DB Source advanced editor (Right click on the OLE DB source, show advanced editor) go to the columns tab and change the outputIsSorted property to True and set change the SortKeyPosition for the columns used in the ORDER BY clause.

Read data in chunks

I don't have good knowledge in MariaDB SQL syntax but i will provide some example in SQLite and Oracle:

Update 1 - Package problems

There are some problems in the package:

You are reading and writing from the same table
You are performing Update and delete tables on a large amount of data
You are using Merge Join

Some recommendations:

Try using a staging table instead of reading and writing from the same table since you are reading, writing, deleting, updating from the same destination table.
Use partitioning in the destination table which allows to delete and update records from a specific partition instead of the entire table

edited May 02 '19 at 21:19

answered Apr 26 '19 at 07:31

Hadi

31,125
9
49
111

1

Sorry for the late response. I've been trying this out now for a couple of days now and sure enough it made a difference, but it still to slow and i get still get failures. Is there any more ways to optimize the flows for handling millions of rows.. this doesn't make the whole cut unfortunately... – John May 02 '19 at 15:57
@John i updated my answer to give some recommendations. Hope it will help – Hadi May 02 '19 at 21:19
So i am trying the "staging table" route, and as i've read up a bit on this type of solution i have noised some things wrong with my solution. From what i have found just as you said @hadi, i do most of my processing in the package. I should be doing it in the server in combination with the staging table. One majour flaw i found was that most of the Big Data set ETL jobs use OLEDB providers as they are much faster than the ADO NET providers that i use. The thing is MySQL/MariaDB does'nt provide an OLE DB data provider, only an ODBC and NET/connector one wich i am using. What do you think? – John May 03 '19 at 21:06
Try using [ODBC destination](https://docs.microsoft.com/en-us/sql/integration-services/data-flow/odbc-destination?view=sql-server-2017) instead of ADO – Hadi May 03 '19 at 21:24
I used the ODBC destination from Both MySQL and MariaDB and for some reason their data providers messes up the "decimal" datatype. We had a problem in which the data provider read the decimal(8,1) datatype as decimal(81) which messed up the datatype limit and crashed our package in the first data flow component (when selecting data from source). This only occurred in VS SSIS, when i tried it in Excel there where no problem in fetching data from source. – John May 03 '19 at 21:40
By the way, does the source matter regarding wichn data provider i use, or is it just the destination that matters regarding speed (not calculating in the network latency)? – John May 04 '19 at 10:23
Both of them matters – Hadi May 04 '19 at 10:31
@John check the following link https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd425070(v%3dsql.100)#performance-counters – Hadi May 04 '19 at 10:32

Optimizing SSIS package for millions of rows with Order by / sort in SQL command and Merge Join

1 Answers1

Server side sorting

Read data in chunks

Update 1 - Package problems

Linked