0

I have data, and each entry needs to be an instance of a class. I'm expecting to encounter many duplicate entries in my data. I essentially want to end up with a set of all the unique entries (ie discard any duplicates). However, instantiating the whole lot and putting them into a set after the fact is not optimal because...

  1. I have many entries,
  2. the proportion of duplicated entries is expected to be rather high,
  3. my __init__() method is doing quite a lot of costly computation for each unique entry, so I want to avoid redoing these computations unnecessarily.

I recognize that this is basically the same question asked here but...

  1. the accepted answer doesn't actually solve the problem. If you make __new__() return an existing instance, it doesn't technically make a new instance, but it still calls __init__() which then redoes all the work you've already done, which makes overriding __new__() completely pointless. (This is easily demonstrated by inserting print statements inside __new__() and __init__() so you can see when they run.)

  2. the other answer requires calling a class method instead of calling the class itself when you want a new instance (eg: x = MyClass.make_new() instead of x = MyClass()). This works, but it isn't ideal IMHO since it is not the normal way one would think to make a new instance.

Can __new__() be overridden so that it will return an existing entity without running __init__() on it again? If this isn't possible, is there maybe another way to go about this?

ibonyun
  • 305
  • 1
  • 8
  • What makes you say that "it is not the normal way one would think to make a new instance."? In Python, having a classmethod as a constructor is a pretty standard practice. – Mad Physicist Jun 19 '18 at 18:32
  • 2
    `MyClass.make_new()` is definitely the way to go. Having `MyClass()` return an existing object is non at all obvious to a developer reading the code. Avoid surprising behavior. It's a trap that C++ programmers sometimes fall into: hiding booby traps behind overloaded operators. – John Kugelman Jun 19 '18 at 18:35
  • 1
    It seems you want to create a singleton class: https://stackoverflow.com/questions/6760685/creating-a-singleton-in-python – juanpa.arrivillaga Jun 19 '18 at 18:43
  • 1
    Note, generally to do what you want you could work with the metaclass `__call__` method: https://stackoverflow.com/questions/6966772/using-the-call-method-of-a-metaclass-instead-of-new – juanpa.arrivillaga Jun 19 '18 at 18:44
  • @MadPhysicist There are no constructors in my organization's code, nor have I been taught about them in any of the several Python MOOCs I've taken. That's why I say that. But I could be mistaken, since my experience is limited. – ibonyun Jun 19 '18 at 19:05
  • @juanpa.arrivillaga I don't want to limit the class to 1 instance. I just want to avoid having duplicate instances (and not recompute the same stuff potentially hundreds of times when I encounter what would be a duplicate). But I'll look into singletons. Maybe there are ideas there that can help me. – ibonyun Jun 19 '18 at 19:13
  • Ok, it sounds like you want to use *caching* instead of a singleton. Python provides an implementation in `functools.lru_cache` . If your arguments are hashable, just use a factory function decorated by `functools.lru_cache(maxsize=None) # or some reasonable limit` – juanpa.arrivillaga Jun 19 '18 at 19:28
  • This question's title now seems misguided and presupposes what the answer to an implied (more general) question might have been. I will attempt to edit so that it better reflects what I was really trying to ask, which was successfully addressed by the accepted answer. – ibonyun Jun 20 '18 at 22:30

1 Answers1

1

Assuming you have a way of identifying duplicate instances, and a mapping of such instances, you have a few viable options:

  1. Use a classmethod to get your instances for you. The classmethod would serve a similar purpose to __call__ in your metaclass (currently type). The main difference is that it would check if an instance with the requested key already exists before calling __new__:

    class QuasiSingleton:
        @classmethod
        def make_key(cls, *args, **kwargs):
            # Creates a hashable instance key from initialization parameters
    
        @classmethod
        def get_instance(cls, *args, **kwargs):
            key = cls.make_key(*args, **kwargs)
            if not hasattr(cls, 'instances'):
                cls.instances = {}
            if key in cls.instances:
                return cls.instances[key]
            # Only call __init__ as a last resort
            inst = cls(*args, **kwargs)
            cls.instances[key] = inst
            return inst
    

    I would recommend using this base class, especially if your class is mutable in any way. You do not want modifications of one instance show up in another without making it clear that the instances are potentially the same. Doing cls(*args, **kwargs) implies that you are getting a different instance every time, or at least that your instances are immutable and you don't care.

  2. Redefine __call__ in your metaclass:

    class QuasiSingletonMeta(type):
        def make_key(cls, *args, **kwargs):
            ...
    
        def __call__(cls, *args, **kwargs):
            key = cls.make_key(*args, **kwargs)
            if not hasattr(cls, 'instances'):
                cls.instances = {}
            if key in cls.instances:
                return cls.instances[key]
            inst = super().__call__(*args, **kwargs)
            cls.instances[key] = inst
            return inst
    

    Here, super().__call__ is equivalent to calling __new__ and __init__ for cls.

In both cases, the basic caching code is the same. The main difference is how to get a new instance from the user's perspective. Using a classmethod like get_instance intuitively informs the user that they are getting a duplicate instance. Using a normal call to the class object implies that the instance will always be new, so should only be done for immutable classes.

Notice that in neither case shown above is there much of a point to calling __new__ without __init__.

  1. A third, hybrid option is possible though. With this option, you would be creating a new instance, but copying over the expensive part of __init__'s computation from an existing instance, instead of doing it all over again. This version would not cause any problems if implemented through the metaclass, since all the instances would in fact be independent:

    class QuasiSingleton:
        @classmethod
        def make_key(cls, *args, **kwargs):
            ...
    
        def __new__(cls, *args, **kwargs):
            if 'cache' not in cls.__dict__:
                cls.cache = {}
            return super().__new__(cls, *args, **kwargs)
    
        def __init__(self, *args, **kwargs):
            key = self.make_key(*args, **kwargs)
            if key in self.cache:  # Or more accurately type(self).instances
                data = self.cache[key]
            else:
                data = # Do lengthy computation
            # Initialize self with data object
    

    With this option, remember to call super().__init__ and (super().__new__ if you need it).

Mad Physicist
  • 76,709
  • 19
  • 122
  • 186
  • Examples helped. Option 2 is effectively what I was looking for. (My class is indeed immutable). Though I had dismissed it earlier, I now see the merits of Option 1 as well. Thanks. – ibonyun Jun 20 '18 at 22:21
  • In Option 3, why did you bother defining `__new__`? Couldn't the cache class attribute be initialized outside of a method definition? – ibonyun Jun 20 '18 at 22:22
  • @ibonyun. Putting the new dict code in `__new__` ensures that inheritance will work properly. I didn't do the check right, so that's fixed now. – Mad Physicist Jun 21 '18 at 02:44