Case-insensitive string startswith in Python

Question

Here is how I check whether mystring begins with some string:

>>> mystring.lower().startswith("he")
True

The problem is that mystring is very long (thousands of characters), so the lower() operation takes a lot of time.

QUESTION: Is there a more efficient way?

My unsuccessful attempt:

>>> import re;
>>> mystring.startswith("he", re.I)
False

Thanks. I benefited from your 'problematic' example. – Jiminion Jan 08 '16 at 16:54 — Jiminion, Jan 08 '16 at 16:54

score 60 · Accepted Answer · edited Nov 19 '15 at 11:01

You could use a regular expression as follows:

In [33]: bool(re.match('he', 'Hello', re.I))
Out[33]: True 

In [34]: bool(re.match('el', 'Hello', re.I))
Out[34]: False

On a 2000-character string this is about 20x times faster than lower():

In [38]: s = 'A' * 2000

In [39]: %timeit s.lower().startswith('he')
10000 loops, best of 3: 41.3 us per loop

In [40]: %timeit bool(re.match('el', s, re.I))
100000 loops, best of 3: 2.06 us per loop

If you are matching the same prefix repeatedly, pre-compiling the regex can make a large difference:

In [41]: p = re.compile('he', re.I)

In [42]: %timeit p.match(s)
1000000 loops, best of 3: 351 ns per loop

For short prefixes, slicing the prefix out of the string before converting it to lowercase could be even faster:

In [43]: %timeit s[:2].lower() == 'he'
1000000 loops, best of 3: 287 ns per loop

Relative timings of these approaches will of course depend on the length of the prefix. On my machine the breakeven point seems to be about six characters, which is when the pre-compiled regex becomes the fastest method.

In my experiments, checking every character separately could be even faster:

In [44]: %timeit (s[0] == 'h' or s[0] == 'H') and (s[1] == 'e' or s[1] == 'E')
1000000 loops, best of 3: 189 ns per loop

However, this method only works for prefixes that are known when you're writing the code, and doesn't lend itself to longer prefixes.

Your test is a bit wrong, as you do not include the time of `re.complie()` — Zaur Nasibov, Nov 27 '12 at 08:42
@BasicWolf: The key is in *"If you are matching the same prefix repeatedly..."*. What it's saying is that the cost of the compile (~900ns) gets amortised across many matches and becomes negligible. — NPE, Nov 27 '12 at 14:07
Note that this really only works for the trivial cases, since Python's regex implementation is sadly not compliant with the actual Unicode standard. To name just one example, in a case insensitive comparison `ß == SS` should be true, but `re.match('ß', 'SS', re.I)` does not match. To be fair the lower() solution is just as incorrect, so no big harm there. — Voo, Mar 22 '17 at 14:19

score 29 · Answer 2 · answered Nov 27 '12 at 07:02

29

How about this:

prefix = 'he'
if myVeryLongStr[:len(prefix)].lower() == prefix.lower()

answered Nov 27 '12 at 07:02

inspectorG4dget

97,394
22
128
222

Interesting idea. – Developer Feb 22 '18 at 14:15

score 7 · Answer 3 · answered Dec 21 '18 at 19:23

7

Another simple solution is to pass a tuple to startswith() for all the cases needed to match e.g. .startswith(('case1', 'case2', ..)).

For example:

>>> 'Hello'.startswith(('He', 'HE'))
True
>>> 'HEllo'.startswith(('He', 'HE'))
True
>>>

answered Dec 21 '18 at 19:23

Aziz Alto

14,579
4
64
50

That's unlikely to happen in the real world when both strings are typically coming from external source... Also generating all options is extremely inefficient (2^n options for string of length n) – The Godfather Mar 01 '21 at 15:52

score 3 · Answer 4 · answered Mar 22 '17 at 14:31

3

None of the given answers is actually correct, as soon as you consider anything outside the ASCII range.

For example in a case insensitive comparison ß should be considered equal to SS if you're following Unicode's case mapping rules.

To get correct results the easiest solution is to install Python's regex module which follows the standard:

import re
import regex
# enable new improved engine instead of backwards compatible v0
regex.DEFAULT_VERSION = regex.VERSION1 

print(re.match('ß', 'SS', re.IGNORECASE)) # none
print(regex.match('ß', 'SS', regex.IGNORECASE)) # matches

answered Mar 22 '17 at 14:31

Voo

26,852
9
70
145

1

Re: your comment on german language stack about "noob." I just wanted to tell you there's no "statue of limitations." :D it's a statuTe! – Buttle Butkus May 27 '17 at 00:26
@buttle Swype very much disagrees with that. Although I do like the idea of a statue ;-) – Voo May 27 '17 at 05:14

Alex L · Answer 5 · 2012-11-27T07:20:05.400

Depending on the performance of .lower(), if prefix was small enough it might be faster to check equality multiple times:

s =  'A' * 2000
prefix = 'he'
ch0 = s[0] 
ch1 = s[1]
substr = ch0 == 'h' or ch0 == 'H' and ch1 == 'e' or ch1 == 'E'

Timing (using the same string as NPE):

>>> timeit.timeit("ch0 = s[0]; ch1 = s[1]; ch0 == 'h' or ch0 == 'H' and ch1 == 'e' or ch1 == 'E'", "s = 'A' * 2000")
0.2509511683747405

= 0.25 us per loop

Compared to existing method:

>>> timeit.timeit("s.lower().startswith('he')", "s = 'A' * 2000", number=10000)
0.6162763703208611

= 61.63 us per loop

(This is horrible, of course, but if the code is extremely performance critical then it might be worth it)

Of course using Python is pretty much dumb in the first place if those milliseconds count like that. — Mad Physicist, Apr 12 '16 at 15:34

score 0 · Answer 6 · answered Jul 25 '20 at 01:27

In Python 3.8, the fastest solution involves slicing and comparing the prefix, as suggested in this answer:

def startswith(a_source: str, a_prefix: str) -> bool:
    source_prefix = a_source[:len(a_prefix)]
    return source_prefix.casefold() == a_prefix.casefold()

The second fastest solution uses ctypes (e.g., _wcsicmp.) Note: This is a Windows example.

import ctypes.util

libc_name = ctypes.util.find_library('msvcrt')
libc = ctypes.CDLL(libc_name)

libc._wcsicmp.argtypes = (ctypes.c_wchar_p, ctypes.c_wchar_p)

def startswith(a_source: str, a_prefix: str) -> bool:
    source_prefix = a_source[:len(a_prefix)]
    return libc._wcsicmp(source_prefix, a_prefix) == 0

The compiled re solution is the third fastest solution, including the cost of compilation. That solution is even slower if the regex module is used for full Unicode support, as suggested in this answer. Each successive match costs around the same as each of the ctypes calls.

lower() and casefold() are expensive because these functions create new Unicode strings by iterating over each character in the source strings, regardless of case, and mapping them accordingly. (See: How is the built-in function str.lower() implemented?) The time spent in that loop increases with each character, so if you're dealing with short prefixes and long strings, call these functions on only the prefixes.

score 0 · Answer 7 · answered Aug 13 '20 at 03:12

0

Another option:

import re
o = re.search('(?i)^we', 'Wednesday')
print(o != None)

https://docs.python.org/library/re.html#re.I

answered Aug 13 '20 at 03:12

Steven Penny

82,115
47
308
348

Case-insensitive string startswith in Python

7 Answers7