13

I'm using LINQ to compare two DataSets with each other to create new rows and update existing. I've noticed that the complete comparison lasts ~1,5 hours and only one of the two cores is busy(Task-Manager is 50-52% CPU Usage). I must admit that I'm completely new to parallel LINQ, but I assume that it could increase performance significantly.

So my question is, how and what should I parallelize?

These are the original queries(reduced to the essentials):

'check for new data
Dim srcUnique = From row In src.Email_Total
                Select Ticket_ID = row.ticket_id, Interaction = row.interaction, ModifiedAt = row.modified_time

Dim destUnique = From row In dest.ContactDetail
                 Where row.ContactRow.fiContactType = emailContactType.idContactType
                 Select row.ContactRow.Ticket_ID, row.Interaction, row.ModifiedAt

'get all emails(contactdetails) that are in source but not in destination
Dim diffRows = srcUnique.Except(destUnique).ToList

'get all new emails(according to ticket_id) for calculating contact columns
Dim newRowsTickets = (From row In src.Email_Total
                     Join d In diffRows
                     On row.ticket_id Equals d.Ticket_ID _
                     And row.interaction Equals d.Interaction _
                     And row.modified_time Equals d.ModifiedAt
                     Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

For Each ticket In newRowsTickets
     Dim contact = dest.Contact.FindByTicket_IDfiContactType(ticket.Ticket_ID, emailContactType.idContactType)
     If contact Is Nothing Then
          ' Create new Contact with many sub-queries on this ticket(omitted) ****'
          Dim newContact = Me.dest.Contact.NewContactRow
          dest.Contact.AddContactRow(newContact)
          contact = newContact
     Else
          ' Update Contact with many sub-queries on this ticket(omitted) '
     End If
     daContact.Update(dest.Contact)

     ' Add new ContactDetail-Rows from this Ticket(this is the counterpart of the src.Email_Total-Rows, details omitted) '
     For Each newRow In ticket.NewTicketRows
         Dim newContactDetail = dest.ContactDetail.NewContactDetailRow
         newContactDetail.ContactRow = contact
         dest.ContactDetail.AddContactDetailRow(newContactDetail)
     Next
     daContactDetails.Update(dest.ContactDetail)
Next

Note: daContact and daContactDetails are SqlDataAdapters, source and dest are DataSets and Contact and ContactDetail are DataTables, where every ContactDetail belongs to a Contact.

Even if not both cores would use 100% CPU, I assume that it would increase performance significantly if I would parallelize the queries, because the second core is nearly idle. The for each might also be a good place to optimize since the tickets are not related to each other. So I assume that I could loop with multiple threads and create/update records parallel. But how to do it with PLINQ?

Side Note: As I've mentioned in the comments, performance is not a key factor for me so far, since the server's only purpose is to synchronize the MySQL Database(on another server) with a MS SQL-Server(on the same server as this Windows-Service). It acts as a source for reports that are generated by another service. But these reports are only generated once a day. But apart from that I was interested in learning PLINQ because I thought that this could be an excellent exercise. It takes the mentioned 1,5h only if destination DB is empty and all records must be created. If both databases are nearly in sync, this method takes only ~1 minute yet. In future performance will become more important since email is only one of several contact-types(chat+calls will exceed 1mil.records). I think that I'll anyway need some kind of (LINQ) Data-Paging then.

If something is unclear I'll update my answer accordingly. Thanks in advance.


Edit: Here is the result of my investigations and attempts:

Question: How to "PLINQ" an existing LINQ query with joins?

Answer: Note that some LINQ operators are binary—they take two IEnumerables as input. Join is a perfect example of such an operator. In these cases, the type of the left-most data source determines whether LINQ or PLINQ is used. Thus you need only call AsParallel on the first data source for your query to run in parallel:

IEnumerable<T> leftData = ..., rightData = ...;
var q = from x in leftData.AsParallel()
        join y in rightData on x.a == y.b
        select f(x, y);

But if I change my query in the following way(note the AsParallel):

Dim newRowsTickets = (From row In src.Email_Total.AsParallel()
                                        Join d In diffRows
                                        On row.ticket_id Equals d.Ticket_ID _
                                        And row.interaction Equals d.Interaction _
                                        And row.modified_time Equals d.ModifiedAt
                                    Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

The compiler will complain that I need to add AsParallel to the right datasource as well. So this seem to be a VB.NET issue or a lack of documentation(article is from 2007). I assume the latter because the(apart from that recommendable) article also says that you need to add System.Concurrency.dll manually but actually it is part of .NET 4.0 Framework and in Namespace Sytem.Threading.Tasks.

I realized that I won't profit from a parallelized Except since the query is fast enough in sequential mode(even with nearly the same number of rows in both collection which results in the maximum number of comparisons, I got the result in less than 30 seconds). But I will add it for the sake of completeness later.

So I decided to parallelize the for-each what is as easy as with LINQ-Queries, you simply need to add AsParallel() at the end. But I realized that I need to force the parallelism with WithExecutionMode(ParallelExecutionMode.ForceParallelism), otherwise .NET decides to use only one core for this loop. I also wanted to tell .NET that I wish to use as many Threads as possible but not more than 8: WithDegreeOfParallelism(8).

Now both cores are working at the same time, but the CPU usage stays on 54%.

So this is the PLINQ version so far:

Dim diffRows = srcUnique.AsParallel.Except(destUnique.AsParallel).ToList

Dim newRowsTickets = (From row In src.Email_Total.AsParallel()
                        Join d In diffRows.AsParallel()
                        On row.ticket_id Equals d.Ticket_ID _
                        And row.interaction Equals d.Interaction _
                        And row.modified_time Equals d.ModifiedAt
                    Group row By Ticket_ID = row.ticket_id Into NewTicketRows = Group).ToList

For Each ticket In newRowsTickets.
                    AsParallel().
                      WithDegreeOfParallelism(8).
                       WithExecutionMode(ParallelExecutionMode.ForceParallelism)
    '  blah,blah ...  '

    'add new ContactDetails for this Ticket(only new rows)
    For Each newRow In ticket.NewTicketRows.
                                AsParallel().
                                    WithExecutionMode(ParallelExecutionMode.Default)
        ' blah,blah ... '
    Next
    daContactDetails.Update(dest.ContactDetail)
Next

Unfortunately I don't see any performance benefits from using AsParallel in comparison with sequential mode:

The for each with AsParallel(hh:mm:ss.mm):

09/29/2011 18:54:36: Contacts/ContactDetails created or modified. Duration: 01:21:34.40

And without:

09/29/2011 16:02:55: Contacts/ContactDetails created or modified. Duration: 01:21:24.50

Can somebody explain me this result? Is the database' write access in the for each responsible for the similar time?


Following are recommendable readings:

greatwolf
  • 18,899
  • 13
  • 64
  • 102
Tim Schmelter
  • 411,418
  • 61
  • 614
  • 859
  • I assume your enumerations / collections are LINQ-to-Objects and not LINQ-to-SQL entities? Have you tried just adding .AsParallel() to the end of the collections? – Neil Fenwick Sep 28 '11 at 11:49
  • @Neil: Yes, it's LINQ-To-Dataset a subform of LINQ-To-Objects(added tag). I've just noticed the possibility to optimize it with PLINQ, so i haven't tested much by now. I already realized that i simply could as `.AsParallel` to the end of the collections, but because i'm new to this topic i didn't know which way is the best. I could surround the whole queries with `.AsParaLell` or only the datasources in it, what's the difference? – Tim Schmelter Sep 28 '11 at 11:58
  • @TimSchmelter have you timed each section of the Linq, or is it the object creation in the foreach section that is taking time? – msarchet Sep 28 '11 at 14:11
  • @msarchet: I must admit that i didn't have the time yet to measure the time so i can only guess. Currently there are ~250.000 rows in the source DataTable(loaded from MySQL, will be inserted into MS SQL-Server after comparison,normalization etc.). It depends on how often the Synchronization between both Databases with this Windows-Service happens, therefore how many rows must be created/updated. With an empty destination DataTable the for-each would take more than 90% of the time, when both are nearly in-sync the comparison would take more than 90%. – Tim Schmelter Sep 28 '11 at 14:37
  • @msarchet: In production the latter would be standard, but i would like to know if i could also simply optimize the `for-each`. Besides, the abovementioned 1,5h are the time i would need for a complete import, normally it takes only a minute if the service is triggered every hour and searches new rows. So this question is somewhat theoretical, because i thought this could be a good exercise to learn PLINQ ;) – Tim Schmelter Sep 28 '11 at 14:40
  • Does the row `Dim diffRows = srcUnique.Except(destUnique).ToList`really take 1,5h to execute? – Magnus Sep 28 '11 at 17:48
  • Have you identified which parts that takes to most time? – Magnus Sep 28 '11 at 18:00
  • @Magnus: No, the complete method inclusive the `for each` and only if the destination DataTable is empty, as i've mentioned in the last comments. In future that will become more important since email is only one of several contact-types(chat+calls will exceed 1mil.records). I havent yet measured the exact time(see also last comment). – Tim Schmelter Sep 28 '11 at 18:03
  • Consider left joning `newRowsTickets` with `dest.Contact` instead of using the lookup function before doing the loop. – Magnus Sep 28 '11 at 18:22
  • One thing to keep in mind with LINQ to Datasets is that using Joins you are doing in memory table scans for each row, but if you use the native Dataset associations, they do hash lookups when navigating from the parent to the child. You may get better performance increase using the native dataset functions rather than LINQ. That being said, I would highly recommend pre-filtering your result sets and not bringing the entire database into memory before querying with LINQ. – Jim Wooley Sep 28 '11 at 21:00
  • 2
    @Jim: A Join also uses a Hash algorithm to link tables, see [my question on this issue](http://stackoverflow.com/questions/5551264/why-is-linq-join-so-much-faster-than-linking-with-where) and Guffa's & Thomas' answers. The problem with prefiltering the datasource is that it's an unnormalised MySQL database without foreignkeys(MyISAM) that is not under my control. The source also contains a lot of inconsistencies, duplicates and the data is spread over multiple tables(Open_Mails,Closed_Mails,etc). This was one reason why i decided to use my own(clean) database and develop this sync-service. – Tim Schmelter Sep 28 '11 at 21:35

1 Answers1

1

There are 3 points worth investigating further,

  1. Do not use .toList(). I might be wrong but I think using .ToList this way would not allow the compiler to optimize the query, if further optimization was possible.
  2. Use your own filtering operation to compare data from both destionations. It might give you better performance.
  3. See if you could use LinqDataview to provide better performance.

    I dont think you will gain an advantage of PLinq while doing insertion. Look at this answer for more details.

Hope that helps. Please do ask if you need clarification on any of the above points.

Community
  • 1
  • 1
Ali Khalid
  • 1,287
  • 8
  • 19