24

I want to find all items in one collection that do not match another collection. The collections are not of the same type, though; I want to write a lambda expression to specify equality.

A LINQPad example of what I'm trying to do:

void Main()
{
    var employees = new[]
    {
        new Employee { Id = 20, Name = "Bob" },
        new Employee { Id = 10, Name = "Bill" },
        new Employee { Id = 30, Name = "Frank" }
    };

    var managers = new[]
    {
        new Manager { EmployeeId = 20 },
        new Manager { EmployeeId = 30 }
    };

    var nonManagers =
    from employee in employees
    where !(managers.Any(x => x.EmployeeId == employee.Id))
    select employee;

    nonManagers.Dump();

    // Based on cdonner's answer:

    var nonManagers2 =
    from employee in employees
    join manager in managers
        on employee.Id equals manager.EmployeeId
    into tempManagers
    from manager in tempManagers.DefaultIfEmpty()
    where manager == null
    select employee;

    nonManagers2.Dump();

    // Based on Richard Hein's answer:

    var nonManagers3 =
    employees.Except(
        from employee in employees
        join manager in managers
            on employee.Id equals manager.EmployeeId
        select employee);

    nonManagers3.Dump();
}

public class Employee
{
    public int Id { get; set; }
    public string Name { get; set; }
}

public class Manager
{
    public int EmployeeId { get; set; }
}

The above works, and will return Employee Bill (#10). It does not seem elegant, though, and it may be inefficient with larger collections. In SQL I'd probably do a LEFT JOIN and find items where the second ID was NULL. What's the best practice for doing this in LINQ?

EDIT: Updated to prevent solutions that depend on the Id equaling the index.

EDIT: Added cdonner's solution - anybody have anything simpler?

EDIT: Added a variant on Richard Hein's answer, my current favorite. Thanks to everyone for some excellent answers!

TrueWill
  • 23,842
  • 7
  • 88
  • 133

8 Answers8

31

This is almost the same as some other examples but less code:

employees.Except(employees.Join(managers, e => e.Id, m => m.EmployeeId, (e, m) => e));

It's not any simpler than employees.Where(e => !managers.Any(m => m.EmployeeId == e.Id)) or your original syntax, however.

DoctorFoo
  • 10,061
  • 3
  • 39
  • 61
  • Actually I like this better than the other solutions - I find its meaning clearer. I rewrote the join in query syntax (see the revised sample code in my question) out of personal preference. Thank you! – TrueWill Oct 31 '09 at 17:17
  • when a big collection is involved, except is tooooo slow. the join answer is the best. – L.T. Mar 16 '16 at 15:19
5
    /// <summary>
    /// This method returns items in a set that are not in 
    /// another set of a different type
    /// </summary>
    /// <typeparam name="T"></typeparam>
    /// <typeparam name="TOther"></typeparam>
    /// <typeparam name="TKey"></typeparam>
    /// <param name="items"></param>
    /// <param name="other"></param>
    /// <param name="getItemKey"></param>
    /// <param name="getOtherKey"></param>
    /// <returns></returns>
    public static IEnumerable<T> Except<T, TOther, TKey>(
                                           this IEnumerable<T> items,
                                           IEnumerable<TOther> other,
                                           Func<T, TKey> getItemKey,
                                           Func<TOther, TKey> getOtherKey)
    {
        return from item in items
               join otherItem in other on getItemKey(item)
               equals getOtherKey(otherItem) into tempItems
               from temp in tempItems.DefaultIfEmpty()
               where ReferenceEquals(null, temp) || temp.Equals(default(TOther))
               select item;
    }

I don't remember where I found this method.

cdonner
  • 34,608
  • 21
  • 96
  • 146
  • +1 - Nice. I modified this slightly and incorporated it in my question. I want to see what others come up with, though. Thanks! – TrueWill Oct 30 '09 at 14:53
5

         var nonManagers = ( from e1 in employees
                             select e1 ).Except(
                                   from m in managers
                                   from e2 in employees
                                   where m.EmployeeId == e2.Id
                                   select e2 );
4

It's a bit late (I know).

I was looking at the same problem, and was considering a HashSet because of various performance hints in that direction inc. @Skeet's Intersection of multiple lists with IEnumerable.Intersect() - and asked around my office and the consensus was that a HashSet would be faster and more readable:

HashSet<int> managerIds = new HashSet<int>(managers.Select(x => x.EmployeeId));
nonManagers4 = employees.Where(x => !managerIds.Contains(x.Id)).ToList();

Then I was offered an even faster solution using native arrays to create a bit mask-ish type solution (the syntax on the native array queries would put me off using them except for extreme performance reasons though).

To give this answer a little credence after an awful long time I've extended your linqpad program and data with timings so you can compare what are now six options :

void Main()
{
    var employees = new[]
    {
        new Employee { Id = 20, Name = "Bob" },
        new Employee { Id = 10, Name = "Kirk NM" },
        new Employee { Id = 48, Name = "Rick NM" },
        new Employee { Id = 42, Name = "Dick" },
        new Employee { Id = 43, Name = "Harry" },
        new Employee { Id = 44, Name = "Joe" },
        new Employee { Id = 45, Name = "Steve NM" },
        new Employee { Id = 46, Name = "Jim NM" },
        new Employee { Id = 30, Name = "Frank"},
        new Employee { Id = 47, Name = "Dave NM" },
        new Employee { Id = 49, Name = "Alex NM" },
        new Employee { Id = 50, Name = "Phil NM" },
        new Employee { Id = 51, Name = "Ed NM" },
        new Employee { Id = 52, Name = "Ollie NM" },
        new Employee { Id = 41, Name = "Bill" },
        new Employee { Id = 53, Name = "John NM" },
        new Employee { Id = 54, Name = "Simon NM" }
    };

    var managers = new[]
    {
        new Manager { EmployeeId = 20 },
        new Manager { EmployeeId = 30 },
        new Manager { EmployeeId = 41 },
        new Manager { EmployeeId = 42 },
        new Manager { EmployeeId = 43 },
        new Manager { EmployeeId = 44 }
    };

    System.Diagnostics.Stopwatch watch1 = new System.Diagnostics.Stopwatch();

    int max = 1000000;

    watch1.Start();
    List<Employee> nonManagers1 = new List<Employee>();
    foreach (var item in Enumerable.Range(1,max))
    {
        nonManagers1 = (from employee in employees where !(managers.Any(x => x.EmployeeId == employee.Id)) select employee).ToList();

    }
    nonManagers1.Dump();
    watch1.Stop();
    Console.WriteLine("Any: " + watch1.ElapsedMilliseconds);

    watch1.Restart();       
    List<Employee> nonManagers2 = new List<Employee>();
    foreach (var item in Enumerable.Range(1,max))
    {
        nonManagers2 =
        (from employee in employees
        join manager in managers
            on employee.Id equals manager.EmployeeId
        into tempManagers
        from manager in tempManagers.DefaultIfEmpty()
        where manager == null
        select employee).ToList();
    }
    nonManagers2.Dump();
    watch1.Stop();
    Console.WriteLine("temp table: " + watch1.ElapsedMilliseconds);

    watch1.Restart();       
    List<Employee> nonManagers3 = new List<Employee>();
    foreach (var item in Enumerable.Range(1,max))
    {
        nonManagers3 = employees.Except(employees.Join(managers, e => e.Id, m => m.EmployeeId, (e, m) => e)).ToList();
    }
    nonManagers3.Dump();
    watch1.Stop();
    Console.WriteLine("Except: " + watch1.ElapsedMilliseconds);

    watch1.Restart();       
    List<Employee> nonManagers4 = new List<Employee>();
    foreach (var item in Enumerable.Range(1,max))
    {
        HashSet<int> managerIds = new HashSet<int>(managers.Select(x => x.EmployeeId));
        nonManagers4 = employees.Where(x => !managerIds.Contains(x.Id)).ToList();

    }
    nonManagers4.Dump();
    watch1.Stop();
    Console.WriteLine("HashSet: " + watch1.ElapsedMilliseconds);

      watch1.Restart();
      List<Employee> nonManagers5 = new List<Employee>();
      foreach (var item in Enumerable.Range(1, max))
      {
                   bool[] test = new bool[managers.Max(x => x.EmployeeId) + 1];
                   foreach (var manager in managers)
                   {
                        test[manager.EmployeeId] = true;
                   }

                   nonManagers5 = employees.Where(x => x.Id > test.Length - 1 || !test[x.Id]).ToList();


      }
      nonManagers5.Dump();
      watch1.Stop();
      Console.WriteLine("Native array call: " + watch1.ElapsedMilliseconds);

      watch1.Restart();
      List<Employee> nonManagers6 = new List<Employee>();
      foreach (var item in Enumerable.Range(1, max))
      {
                   bool[] test = new bool[managers.Max(x => x.EmployeeId) + 1];
                   foreach (var manager in managers)
                   {
                        test[manager.EmployeeId] = true;
                   }

                   nonManagers6 = employees.Where(x => x.Id > test.Length - 1 || !test[x.Id]).ToList();
      }
      nonManagers6.Dump();
      watch1.Stop();
      Console.WriteLine("Native array call 2: " + watch1.ElapsedMilliseconds);
}

public class Employee
{
    public int Id { get; set; }
    public string Name { get; set; }
}

public class Manager
{
    public int EmployeeId { get; set; }
}
Community
  • 1
  • 1
amelvin
  • 8,739
  • 4
  • 36
  • 59
  • Nice data! Thank you! – TrueWill Feb 24 '14 at 14:34
  • If the IDs of your employees and managers are very high, say, in the 100,000s, your sparse array solutions are going to royally barf. There's nothing that says IDs can't be this high--they're ints, and I think it's better to write code that doesn't have strange edge cases like that. – ErikE Sep 05 '14 at 04:28
  • @ErikE I'm not sure what you are driving at. The OP provided the data as part of the question & I've timed 6 alternative ways of processing that data. If the data was different then a different option may be more optimized. Is there an answer that will work best with every conceivable data set? If there is I'd really appreciate it if you laid it out so that I could use it in future. – amelvin Sep 08 '14 at 09:45
  • Imagine if the sparse array solution got an `Int32.Max` value, suddenly causing it to take 8 Gb of memory? The HashSet solution is excellent, and won't blow up if you get spurious high ID values. When you say "the data was different", I think that should not include values within the domain of the declared (or implicit) data types. Unless otherwise specified as smallint (for example), IDs should be assumed to be 4-byte long integers (signed Int32). Any value within the domain of the data type is NOT "different data". – ErikE Sep 08 '14 at 18:39
3
var nonmanagers = employees.Select(e => e.Id)
    .Except(managers.Select(m => m.EmployeeId))
    .Select(id => employees.Single(e => e.Id == id));
G-Wiz
  • 7,141
  • 1
  • 31
  • 46
  • 1
    There is no guarantee that the EmployeeId will match the employee index in the array... – Thomas Levesque Oct 30 '09 at 13:04
  • Nice idea - I didn't think of selecting the IDs so that Except with the default equality comparer would compare integers. However Mr. Levesque is correct, and I've updated the example to reflect this. Can you provide an example that correctly returns the employees? – TrueWill Oct 30 '09 at 14:38
  • (I deleted my previous comment - gWiz is right; this will work.) – TrueWill Oct 31 '09 at 17:22
  • This will have much worse performance than other methods because the `employees.Single` is `O(n)`. See [amelvin's answer](http://stackoverflow.com/a/9622485/57611) (ignore the second solution, and use the HashSet one, it's superior). – ErikE Nov 30 '15 at 19:11
2

Have a look at the Except() LINQ function. It does exactly what you need.

nitzmahone
  • 13,064
  • 1
  • 29
  • 37
  • The except function only works with 2 sets of the same object type, but would not direclty apply to his example with employees and managers. Therefore the overloaded method in my answer. – cdonner Oct 31 '09 at 00:43
1

Its better if you left join the item and filter with null condition

var finalcertificates = (from globCert in resultCertificate
                                         join toExcludeCert in certificatesToExclude
                                             on globCert.CertificateId equals toExcludeCert.CertificateId into certs
                                         from toExcludeCert in certs.DefaultIfEmpty()
                                         where toExcludeCert == null
                                         select globCert).Union(currentCertificate).Distinct().OrderBy(cert => cert.CertificateName);
Mahendra
  • 431
  • 4
  • 7
0

Managers are employees, too! So the Manager class should subclass from the Employee class (or, if you don't like that, then they should both subclass from a parent class, or make a NonManager class).

Then your problem is as simple as implementing the IEquatable interface on your Employee superclass (for GetHashCode simply return the EmployeeID) and then using this code:

var nonManagerEmployees = employeeList.Except(managerList);
ErikE
  • 43,574
  • 19
  • 137
  • 181
  • Good points; this was just a sanitized example though. The general problem of finding non-matches is a good one to solve. – TrueWill Nov 30 '15 at 21:40
  • This might be a good solution to many general problems, though! If two different objects can be merged in some way, then it is possible they share a relationship that could be expressed via superclass/subclass. In this case, a Manager has an "is-a" relationship with an Employee, so it makes perfect sense to use inheritance. "Has-a" relationships are less likely to be susceptible to my suggested solution (but that's not necessarily so, as lifecycles and roles can be tricky to model correctly and developers may miss "is-a" relationships sometimes). – ErikE Dec 01 '15 at 18:02