1

I need to perform case insensitive string comparisons in python in sets and dictionary keys. Now, to create sets and dict subclasses that are case insensitive proves surprisingly tricky (see: Case insensitive dictionary for ideas, note they all use lower - hey there's even a rejected PEP, albeit its scope is a bit broader). So I went with creating a case insensitive string class (leveraging this answer by @AlexMartelli):

class CIstr(unicode):
    """Case insensitive with respect to hashes and comparisons string class"""

    #--Hash/Compare
    def __hash__(self):
        return hash(self.lower())
    def __eq__(self, other):
        if isinstance(other, basestring):
            return self.lower() == other.lower()
        return NotImplemented
    def __ne__(self, other): return not (self == other)
    def __lt__(self, other):
        if isinstance(other, basestring):
            return self.lower() < other.lower()
        return NotImplemented
    def __ge__(self, other): return not (self < other)
    def __gt__(self, other):
        if isinstance(other, basestring):
            return self.lower() > other.lower()
        return NotImplemented
    def __le__(self, other): return not (self > other)

I am fully aware that lower is not really enough to cover all cases of string comparisons in unicode but I am refactoring existing code that used a much clunkier class for string comparisons (memory and speed wise) which anyway used lower() - so I can amend this on a later stage - plus I am on python 2 (as seen by unicode). My questions are:

  • did I get the operators right ?

  • is this class enough for my purposes, given that I take care to construct keys in dicts and set elements as CIstr instances - my purposes being checking equality, containment, set differences and similar operations in a case insensitive way. Or am I missing something ?

  • is it worth it to cache the lower case version of the string (as seen for instance in this ancient python recipe: Case Insensitive Strings). This comment suggests that not - plus I want to have construction as fast as possible and size as small as possible but people seem to include this.

Python 3 compatibility tips are appreciated !

Tiny demo:

d = {CIstr('A'): 1, CIstr('B'): 2}
print 'a' in d # True
s = set(d)
print {'a'} - s # set([])
Community
  • 1
  • 1
Mr_and_Mrs_D
  • 27,070
  • 30
  • 156
  • 325
  • 1
    Are you sure you need a class? Why don't you simply pass a comparison function when needed? Or store stuff `lower`ed? – Karoly Horvath Mar 30 '17 at 15:34
  • @Karoly I need the original strings - passing the comp function would result in less maintenable code – Mr_and_Mrs_D Mar 30 '17 at 15:35
  • I would be worried about having instances of `CIstr` claiming to be equal to normal strings that are not equal to them, and have a different hash. – khelwood Mar 30 '17 at 15:36
  • @khelwood: any example that would this lead to broken behavior ? – Mr_and_Mrs_D Mar 30 '17 at 15:39
  • @Mr_and_Mrs_D: I'm not sure what you're doing, but creating wrapper functions for the lookup probably solves the maintainability issue, but that's just my guess. – Karoly Horvath Mar 30 '17 at 15:39
  • @Mr_and_Mrs_D Well given the dictionary you defined in your tiny demo, I wouldn't expect `d['A']` to work (since capital `'A'` has a different hash from `CIstr('A')`), but maybe that's not a requirement for you. – khelwood Mar 30 '17 at 15:50
  • @khelwood: The contract is that those dicts would only have CIstr instances as keys - ideally, the machinery would be inside the dict but that proves tricky as seen in links – Mr_and_Mrs_D Mar 30 '17 at 15:52
  • But in your demo you are using `'a'` to look stuff up in your set. It wouldn't work if you tried to use `'A'`. Also `'A' in d.keys()` would be true, but `'A' in d` would be false. You've essentially created a type that violates the normal contract of all hashes, by claiming to be equal to objects that have different hashes. – khelwood Mar 30 '17 at 15:53
  • @khelwood: valid points - what existing code does is exactly having sets (dicts) of lowercase strings (keys) to compare to each other. Still the question remains - any way to get this right ? – Mr_and_Mrs_D Mar 30 '17 at 16:03
  • You could combine this answer with the answers about creating specialised dicts, and have a dict that converted any possible key into `CIstr` before trying to look it up. Then all your `CIstr` conversions could be hidden away inside the dictionary class. – khelwood Mar 30 '17 at 16:07
  • @khelwood : that code is fit for an answer :P Note that even the constructor of such a dict is hard to get right – Mr_and_Mrs_D Mar 30 '17 at 16:08

2 Answers2

1

The code mostly looks fine. I would eliminate the short-cut's in __ge__, __le__, and __ne__ and expand them to call lower() directly.

The short-cut looks like what is done in `functools.total_ordering() but it just slows down the code and makes it harder to test cross-type comparisons which are tricky to get right when the methods are interdependent.

Raymond Hettinger
  • 182,864
  • 54
  • 321
  • 419
  • 1
    Unfortunately as it stands leads to wrong behavior (`'A' in d.keys()` vs`'A' in d`) - so accepted @khelwood answer - I had to go ahead and write the wrapper dict: http://stackoverflow.com/a/43457369/281545. Comments more than welcome :) – Mr_and_Mrs_D Apr 17 '17 at 18:39
1

In your demo you are using 'a' to look stuff up in your set. It wouldn't work if you tried to use 'A', because 'A' has a different hash. Also 'A' in d.keys() would be true, but 'A' in d would be false. You've essentially created a type that violates the normal contract of all hashes, by claiming to be equal to objects that have different hashes.

You could combine this answer with the answers about creating specialised dicts, and have a dict that converted any possible key into CIstr before trying to look it up. Then all your CIstr conversions could be hidden away inside the dictionary class.

E.g.

class CaseInsensitiveDict(dict):
    def __setitem__(self, key, value):
        super(CaseInsensitiveDict, self).__setitem__(convert_to_cistr(key), value)
    def __getitem__(self, key):
        return super(CaseInsensitiveDict, self).__getitem__(convert_to_cistr(key))
    # __init__, __contains__ etc.

(Based on https://stackoverflow.com/a/2082169/3890632)

Community
  • 1
  • 1
khelwood
  • 46,621
  • 12
  • 59
  • 83
  • That's what I initially thought - but there are many complications in creating such a class in a bulletproof way - so even the constructor would be tricky - what would `CaseInsensitiveDict({'A': 1})` yield ? – Mr_and_Mrs_D Mar 30 '17 at 16:14
  • I presume that you would have to iterate through the items in the given dict and convert each key to the form you want. If you want something different from that, it's up to you to figure out what your requirement is. – khelwood Mar 30 '17 at 16:16
  • Think also that such a dict should take care to differentiate between string and non string keys – Mr_and_Mrs_D Mar 30 '17 at 16:17
  • I went for the wrapper dict - please have a look at http://stackoverflow.com/a/43457369/281545 :) – Mr_and_Mrs_D Apr 17 '17 at 18:40