Should you check for a duplicate before inserting into a set

Question

I am learning to use sets. My question is : Sets do not contain duplicates. When we try to insert duplicates, it does not throw any error and automatically removes duplicates. Is it a good practice to check each value before inserting into set whether it exists or not? Or is it OK to do something like the below code? I think Java would be internally doing the check using .contains(value) . What do you think?

What would be the Big O complexity in both the cases considering there are n elements going into the set?

import java.util.HashSet;
import java.util.Set;

public class DuplicateTest {

    public static void main(String[] args) {
        // TODO Auto-generated method stub
         Set<Integer> mySet = new HashSet<Integer>();

         mySet.add(10);
         mySet.add(20);
         mySet.add(30);
         mySet.add(40);
         mySet.add(50);
         mySet.add(50);
         mySet.add(50);
         mySet.add(50);
         mySet.add(50);
         mySet.add(50);

         System.out.println("Contents of the Hash Set :"+mySet);
    }

}

Because a `HashSet` is backed by a `HashMap`, your answer can be found here: http://stackoverflow.com/a/4553642/4490686 — Sam Sun, Jan 10 '16 at 23:55
It doesn't do a `contains` instead it just won't add an element which is already there, i.e. it doesn't add any overhead to do this. — Peter Lawrey, Jan 11 '16 at 00:00
Just FYI you can't change Big Oh complexity by adding another operation with the same complexity as the one that is already applied. What I mean, is that both `for(int x: set) { set.add(x); }` and `for(int x:set) { set.contains(x); set.add(x); }` have the same Big Oh complexity, as long as `add` and `contains` have the same complexity. Because O(C*n) == O(n), for any constant number C. — user3707125, Jan 11 '16 at 00:25
It depends on what you want to do with your set. Do you want to find out if sth. is duplicated, then such a check would be necessary, if you only want to remove doublettes, then the set already does everything you want. — Aron_dc, Jan 11 '16 at 10:27
@Aron_dc: since `add` returns a `boolean` telling you whether the element has been added or was a duplicate, there is *never* any reason to check before adding. — Holger, Jan 11 '16 at 11:13
Possible duplicate of [HashMap get/put complexity](http://stackoverflow.com/questions/4553624/hashmap-get-put-complexity) — Ashish Ahuja, Jan 11 '16 at 14:20
@Holger. I have run into cases where you do need to check. In particular, w.r.t. memory use and caching. A million nodes each holding a ("readonly") reference to a different but equivalent copy of something is substantially different in practice than every node holding a reference to the same copy. So you check if the copy is in the (Weak)HashSet, and if so, discard it in favor of the one already there in the set. A little convoluted, but a counterexample none the less. — TLW, Jan 17 '16 at 21:58
@TLW: if you are going to remove the element, if it exists, invoking `contains` before `remove` is the same kind of anti-pattern. Just invoke only `remove` and check the `boolean` result… — Holger, Jan 18 '16 at 09:51
I'm voting to close this question as off-topic because this is easily answered by simply reading the documentation as the accepted answer proves! — , Aug 05 '16 at 22:12

score 32 · Accepted Answer · edited Jun 20 '20 at 09:12

As per the docs:

public boolean add(E e)

Adds the specified element to this set if it is not already present. More formally, adds the specified element e to this set if this set contains no element e2 such that (e==null ? e2==null : e.equals(e2)). If this set already contains the element, the call leaves the set unchanged and returns false.

So the add() method already returns you a true or a false. So you don't need to do the additional check.

score 10 · Answer 2 · edited Jan 11 '16 at 11:16

10

Compare with the API documentation of Set.add(E)

The add method checks if the element is already in the Set. If the element is already present, then the new element is not added, and the Set remains unchanged. In most situations, you don't need to check anything.

The complexity of the method depends of the concrete implementation of Set that you are using.

edited Jan 11 '16 at 11:16

Holger

243,335
30
362
661

answered Jan 10 '16 at 23:56

Alejandro Goñi

492
1
3
14

DMozzy · Answer 3 · 2016-01-11T00:01:43.820

4

Its ok not to check. This is the main advantage over Sets of Lists, as they will automatically filter out duplicates.

HashSet has constant time performance (http://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html)

This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets

edited Jan 11 '16 at 00:01

answered Jan 10 '16 at 23:52

DMozzy

869
5
7

1

@YassinHajaj - have linked to APIi and provided the relevant section. – DMozzy Jan 10 '16 at 23:58

score 2 · Answer 4 · answered Jan 10 '16 at 23:59

The add function returns a boolean which you can check to determine if the item was already in the Set. This is of course based on your needs and isn't a best practice. Its good to know that it will not remove an item that is already there so it can't be depended on to update the existing value with new information if you are defining equals based on surrogate keys from your database. This is opposite the way Maps work as a map will return any existing value and replace it with the new value.

score 1 · Answer 5 · answered Jan 11 '16 at 00:16

Here are answers to your questions:

When we try to insert duplicates, it does not throw any error and automatically removes duplicates.

Your understanding is not correct. The call to Set.add() will not add a new item if it is already in the set; this statement applies to all implementations of Set, including HashSet and TreeSet.

Is it a good practice to check each value before inserting into set whether it exists or not? or is it okay to do something like the below code? I think java would be internally doing the check using .contains(value) . What do you think?

Because your understanding was incorrect from the start, then you do not need to check each value before inserting into the set to see if it already exists. Yes, internally, it is doing something like contains().

What would be the Big Oh complexity in both the cases considering there are "n" elements going into the set?

For HashSet, the time complexity is O(1) for each add(). For TreeSet() -- which you didn't use -- the time complexity is O(lg N) for each add().

A HashSet can have a complexity of `O(n)` if the hashing algorithm is not optimal — Sam Sun, Jan 11 '16 at 00:18

Should you check for a duplicate before inserting into a set

5 Answers5