0

I have a background task (started by IHostedService) that runs every 7 days, this task copies data into the same table only changing the PKs to generate dummy data (it's used for demos).

The issue is when it tries to save the new data for the second table (approx 2k new records) Entity Framework never completes the SaveChangesAsync this process is blocked and it starts consuming the whole RAM available.

Note: The first table sometimes copies more than 20k new records.

This is my current code, I'm using Entity Framework Core 5.0.4 with .NET Core 3.1:

using (var context = _context.CreateNewInstance())
{
    var existingStudents = context.Students.Where(s => s.UniversityId == _destUniversity.Id);
    var sourceStudents = context.Students.Where(s => s.UniversityId == _sourceUniversity.Id)
                                         .Select(s => new Student()
                                         {
                                             //...properties
                                         });

    var newStudents = sourceStudents.Where(s => !existingStudents.Any(es => es.DiffKey == s.DiffKey)).ToArray();

    if (newStudents.Length == 0)
    {
        return;
    }

    await context.Students.AddRangeAsync(newStudents);
    await context.SaveChangesAsync(_cancellationToken.Token);
}

I've tried all this:

  1. Disabling "Auto Detect Changes"
  2. Batch insert: this only works with the first batch.
  3. Use IEnumerable instead of IQueryable
  4. A single context for all tables in the task
  5. New context instance for each table
  6. Compared the configuration of the databases/tables and replicated it on local

I have 2 servers (1 local, 1 remote) with 3 databases (1 local, 2 remote), the code works on two of the three databases (1 local and 1 remote).

I used this code to get the spid of the task on SQL:

int psid;

using (var command = contextTmp.Database.GetDbConnection().CreateCommand())
{
    command.CommandText = "select @@spid as id";
    contextTmp.Database.OpenConnection();

    using (var result = command.ExecuteReader())
    {
        result.Read();
        psid = result.GetInt16("id");
    }

    contextTmp.Database.CloseConnection();
}

With the spid from the previous code I've monitored the requests on the SQL Server side to check its status:

select session_id, 
    status, 
    command,
    blocking_session_id,
    wait_type, 
    wait_time,
    last_wait_type,
    wait_resource
from sys.dm_exec_requests 
where session_id = @id

Result:

session_id status command blocking_session_id wait_type wait_time last_wait_type wait_resource
84 suspended SELECT 0 ASYNC_NETWORK_IO 35 ASYNC_NETWORK_IO

I've read blogs about this issue but I don't get it yet, I think I missing something. I ran out of ideas.

Other sources:

UPDATE(1):

Before we've updated from .NET Core 2.2 to 3.1 all functions worked well. We started to see strange behaviors after the upgrade.

Any suggestions?

marc_s
  • 675,133
  • 158
  • 1,253
  • 1,388
Tecnologer
  • 372
  • 8
  • 15
  • FYI - thats not a deadlock. That straight blocking. A deadlock will be resolved very quickly with one process being rolled back. – Dale K Mar 29 '21 at 19:08
  • good point, thank you. so, what is it? – Tecnologer Mar 29 '21 at 19:10
  • As I said that is straight blocking. – Dale K Mar 29 '21 at 19:11
  • oh ok, that's new for me, let me read something about it – Tecnologer Mar 29 '21 at 19:13
  • I think the key is in the batch works, the first time. Likely a single stored command will be executed at any attempt? Likely the method isn't properly "closing" after execution, and the issue is that entityframework still thinks the table is being worked on, thus blocking another async process from making changes. Try to do the store without async, and see if that doesn't solve your problem, then you know for sure, if it is or not. – Morten Bork Mar 29 '21 at 19:41
  • I've tried with sync methods, the same issue :( – Tecnologer Mar 29 '21 at 20:15
  • What is it being blocked by (although ASYNC_NETWORK_IO usually indicates a lot of data being downloaded over a slow connection)? Are you running that code in a loop or is there another query that also runs? – Charlieface Mar 29 '21 at 20:58
  • I'm running this code from `DoWork` of `Threading.Timer` which is created by `IHostedService`. On my side, I only run this code once per week, one instance at a time, but there may be other applications trying to read/write into this table. – Tecnologer Mar 29 '21 at 21:06

2 Answers2

1

Entity Framework is not good at inserting things in bulk. If an entity has an auto-generated primary key, then after each insert, EF will query the database to get the id of the entry. That means that for each of your 20K records, you have to wait for the round-trip time of the database. That's why everything appears to stop working when you call SaveChanges() - it's still running, but it will take a long time. There are a few ways to solve this:

  • Change the id to be client generated
  • Use an extension library like EntityFrameworkPlus to perform a bulk insert
  • Use SqlBulkCopy instead of Entity Framework
  • Use a stored procedure
Andrew Williamson
  • 6,465
  • 2
  • 30
  • 48
  • I've changed the order of the tables, now the second table is the first, this with the purpose of avoiding the insertion of 20k records, but the problem still there – Tecnologer Mar 29 '21 at 20:21
  • I'm not sure what you mean by this. Are you still adding `newStudents` to the `Students` table? If so, how many records are you trying to add? That is the core problem in your original question, so if that has changed, you need to update the question with more details – Andrew Williamson Mar 29 '21 at 20:28
  • oh sorry, let me explain, I'm inserting data for many tables, Students was the second one, the first table sometimes inserts more than 20k but not always (with this I don't have issues), so I've changed the order of the tables, now I add `newStudents` to `Students` before try to insert anything else to some other table. – Tecnologer Mar 29 '21 at 20:41
  • CDC is part of the problem, if I disable it, everything works correctly, but it's required, so I changed the Id (and others properties) to be generated on the client-side and that solved the issue for now. Thanks, man. – Tecnologer Apr 01 '21 at 14:29
0

Ideally your query do not need transferring data to the client just INSERT FROM. I propose to use third party extension linq2db.EntityFrameworkCore (disclaimer, I'm one of the creators)

Then your query will be executed on the server side almost immediately:

using (var context = _context.CreateNewInstance())
{
    var existingStudents = context.Students.Where(s => s.UniversityId == _destUniversity.Id);
    var sourceStudents = context.Students.Where(s => s.UniversityId == _sourceUniversity.Id)
                                         .Select(s => new Student()
                                         {
                                             //...properties
                                         });

    var newStudents = sourceStudents.Where(s => !existingStudents.Any(es => es.DiffKey == s.DiffKey));

    await newStudents.InsertAsync(context.Students.ToLinqToDBTable(), x => x, _cancellationToken.Token);
}
Svyatoslav Danyliv
  • 5,605
  • 1
  • 5
  • 17