Regex pattern and/or NSRegularExpression a bit too slow searching over very large file, can it be optimized?

Question

In an iOS framework, I am searching through this 3.2 MB file for pronunciations: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

I am using NSRegularExpression to search for an arbitrary set of words that are given as an NSArray. The search is done through the contents of the large file as an NSString. I need to match any word that appears bracketed by a newline and a tab character, and then grab the whole line, for example if I have the word "monday" in my NSArray I want to match this line within the dictionary file:

monday  M AH N D IY

This line starts with a newline, the string "monday" is followed by a tab character, and then the pronunciation follows. The entire line needs to be matched by the regex for its ultimate output. I also need to find alternate pronunciations of the words which are listed as follows:

monday(2)   M AH N D EY

The alternative pronunciations always begin with (2) and can go as high as (5). So I also search for iterations of the word followed by parentheses containing a single number bracketed by a newline and a tab character.

I have a 100% working NSRegularExpression method as follows:

NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.

// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("]; 

int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // After the first iteration we need an OR operator first.
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
     }
    [mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}

[mutablePatternString appendString:@")\\t.*$"];

// This results in this regex pattern:

// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                     options:NSRegularExpressionAnchorsMatchLines
                                                                                       error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
                                     options:0
                                       range:NSMakeRange(rangeLocation, rangeLength)
                                  usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                      [matches addObject:[string substringWithRange:result.range]];
                                  }];

[mutablePatternString release];

// matches array is returned to the caller.

My issue is that given the big text file, it isn't really fast enough on the iPhone. 8 words take 1.3 seconds on an iPhone 4, which is too long for the application. Given the following known factors:

• The 3.2 MB text file has the words to match listed in alphabetical order

• The array of arbitrary words to look up are always in alphabetical order when they get to this method

• Alternate pronunciations start with (2) in parens after the word, not (1)

• If there is no (2) there won't be a (3), (4) or more

• The presence of one alternative pronunciation is rare, occurring maybe 1 time in 8 on average. Further alternate pronunciations are even rarer.

Can this method be optimized, either by improving the regex or some aspect of the Objective-C? I'm assuming that NSRegularExpression is already optimized enough that it isn't going to be worthwhile trying to do it with a different Objective-C library or in C, but if I'm wrong here let me know. Otherwise, very grateful for any suggestions on improving the performance. I am hoping to make this generalized to any pronunciation file so I'm trying to stay away from solutions like calculating the alphabetical ranges ahead of time to do more constrained searches.

****EDIT****

Here are the timings on the iPhone 4 for all of the search-related answers given by August 16th 2012:

dasblinkenlight's create NSDictionary approach https://stackoverflow.com/a/11958852/119717: 5.259676 seconds

Ωmega's fastest regex at https://stackoverflow.com/a/11957535/119717: 0.609593 seconds

dasblinkenlight's multiple NSRegularExpression approach at https://stackoverflow.com/a/11969602/119717: 1.255130 seconds

my first hybrid approach at https://stackoverflow.com/a/11970549/119717: 0.372215 seconds

my second hybrid approach at https://stackoverflow.com/a/11970549/119717: 0.337549 seconds

The best time so far is the second version of my answer. I can't mark any of the answers best, since all of the search-related answers informed the approach that I took in my version so they are all very helpful and mine is just based on the others. I learned a lot and my method ended up a quarter of the original time so this was enormously helpful, thank you dasblinkenlight and Ωmega for talking it through with me.

If your file is under your control and has no junk in it, ou can optimize your regex a little : `^(sunday|monday|tuesday|...)(\t|\\().*$`: you know that whatever comes in parentheses is a single-character followed by a closing parentheses, so you can skip that portion of the match. Bringing all your strings in a single `OR` block might help as well, but I am not sure if it's going to help much. — Sergey Kalinichenko, Aug 14 '12 at 17:12
I don't know if you're the one who is providing the pronunciation file, but if you are you might want to store it differently. If you had all pronunciations grouped with their word in a single object you could search much more quickly by name (alphabetical array searches can be done in log(n) I believe). — Dustin, Aug 14 '12 at 17:19
^(sunday|monday|tuesday|...)(\t|\\().*$ got it down to .90 seconds, definitely a big help, thank you. How does the single OR block work? — Halle, Aug 14 '12 at 17:27
@Halle It works the same as the one that you provided, but since the individual components are simpler, the regex engine is able to generate a faster state machine to perform the match. — Sergey Kalinichenko, Aug 14 '12 at 17:30
Right, sorry, I misunderstood you to be recommending a further optimization that wasn't in your example. — Halle, Aug 14 '12 at 17:35
@Halle What is the encoding of your file? How do you currently read it? — Sergey Kalinichenko, Aug 14 '12 at 18:29
It's UTF-8 and I read it with [[NSString alloc] initWithContentsOfFile:pathToFileAsString encoding:NSUTF8StringEncoding error:&error]; — Halle, Aug 14 '12 at 18:35

score 4 · Answer 1 · edited Jul 07 '20 at 14:11

4

Try this one:

^(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

and also this one (using positive lookahead with list of possible first letters):

^(?=[cmtwfs])(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

and at the end, a version with some optimization:

^(?=[cmtwfs])(?:change|monday|t(?:uesday|hursday)|wednesday|friday|s(?:aturday|unday))(?:\([2-5]\))?\t.*$

edited Jul 07 '20 at 14:11

Iulian Onofrei

7,489
8
59
96

answered Aug 14 '12 at 17:22

Ωmega

37,727
29
115
183

Nice, this got it down to .76 seconds. – Halle Aug 14 '12 at 17:32
OK, that one I'll have to try tomorrow since it will require refactoring the word input to group the words by first character since the words aren't known until runtime. – Halle Aug 14 '12 at 18:29
@Halle - The last one may or may not be the best fit, depeneds on size of input and number of matches. Good luck! – Ωmega Aug 14 '12 at 18:36

Sergey Kalinichenko · Answer 2 · 2012-08-15T00:29:24.517

4

Since you are putting the entire file into memory anyway, you might as well represent it as a structure that is easy to search:

Create a mutable NSDictionary words, with NSString keys and NSMutableArray values
Read the file into memory
Go through the string representing the file line-by-line
For each line, separate out the word part by searching for a '(' or a '\t' character
Get a sub-string for the word (from zero to the index of the '(' or '\t' minus one); this is your key.
Check if the words contains your key; if it does not, add new NSMutableArray
Add line to the NSMutableArray that you found/created at the specific key
Once your are finished, throw away the original string representing the file.

With this structure in hand, you should be able to do your searches in time that no regex engine would be able to match, because you replaced a full-text scan, which is linear, with a hash look-up, which is constant-time.

** EDIT: ** I checked the relative speed of this solution vs. regex, it is about 60 times faster on a simulator. This is not at all surprising, because the odds are stacked heavily against the regex-based solution.

Reading the file:

NSBundle *bdl = [NSBundle bundleWithIdentifier:@"com.poof-poof.TestAnim"];
NSString *path = [NSString stringWithFormat:@"%@/words_pron.dic", [bdl bundlePath]];
data = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSMutableDictionary *tmp = [NSMutableDictionary dictionary];
NSUInteger pos = 0;
NSMutableCharacterSet *terminator = [NSMutableCharacterSet characterSetWithCharactersInString:@"\t("];
while (pos != data.length) {
    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
        rangeOfCharacterFromSet:[NSCharacterSet newlineCharacterSet]
        options:NSLiteralSearch
        range:remaining
    ];
    if (next.location != NSNotFound) {
        next.length = next.location - pos;
        next.location = pos;
    } else {
        next = remaining;
    }
    pos += (next.length+1);
    NSString *line = [data substringWithRange:next];
    NSRange keyRange = [line rangeOfCharacterFromSet:terminator];
    keyRange.length = keyRange.location;
    keyRange.location = 0;
    NSString *key = [line substringWithRange:keyRange];
    NSMutableArray *array = [tmp objectForKey:key];
    if (!array) {
        array = [NSMutableArray array];
        [tmp setObject:array forKey:key];
    }
    [array addObject:line];
}
dict = tmp; // dict is your NSMutableDictionary ivar

Searching:

NSArray *keys = [NSArray arrayWithObjects:@"sunday", @"monday", @"tuesday", @"wednesday", @"thursday", @"friday", @"saturday", nil];
NSMutableArray *all = [NSMutableArray array];
NSLog(@"Starting...");
for (NSString *key in keys) {
    for (NSString *s in [dict objectForKey:key]) {
        [all addObject:s];
    }
}
NSLog(@"Done! %u", all.count);

edited Aug 15 '12 at 00:29

answered Aug 14 '12 at 18:51

Sergey Kalinichenko

675,664
71
998
1,399

I like the idea very much but for me using the cmu07a.dic file this takes around .4 seconds on the Simulator and the regex from Ωmega takes 0.07 seconds. Not sure why the difference in our results. I also tried a version of the same approach by converting the string into an array at the start using componentsSeparatedBy:@"\n" and then fast enumerating through them with a second round of componentsSeparatedBy:@"\t" and time was about the same (although memory footprint not). – Halle Aug 15 '12 at 09:42
@Halle That is very surprising - my timing on a simulator is about 0.002 for `NSMutableDictionary` and 0.12 for the regexp. I do not include building of the `NSMutableDictionary` in my timing, though: I build `dict` once in the `viewDidLoad` method, and never touch it again. It's only the "Searching:" portion of the above code that gets timed. – Sergey Kalinichenko Aug 15 '12 at 09:53
Got it, not surprising if you aren't including the dictionary build-up, but the regexes are being timed from reading in the dictionary as a string to completion of the search since time processing the file counts for the overall question of the application latency. This operation is likely to only be run once in an app session and right after the class is initialized so there's no big opportunity to pre-cache the file. – Halle Aug 15 '12 at 09:59
@Halle Then it is not necessary to build a dictionary. Just going through the file once will be sufficient. I'll give it a try and see what I get. – Sergey Kalinichenko Aug 15 '12 at 10:08
Good point! For my part I'm now wondering if it's effective to process the string into XML (given that it's formatting is standardized) and serialize it directly to an NSDictionary without intermediary steps. The text processing is probably too slow though. I could try regex ;) . – Halle Aug 15 '12 at 10:13
OK, I tested out taking the final NSDictionary output and writing it out as a binary plist that I could distribute instead of the text .dic file (not ideal but OK as a stopgap -- I'd have to provide methods to convert other pronunciation dictionaries into the same format but that isn't prohibitive) but even the time to pull the binary plist data directly into an NSDictionary was twice as much as the regex time. – Halle Aug 15 '12 at 11:09
@Halle I wouldn't go that route, because the number of objects that you need to create while reading the dictionary from the file is vastly higher. Creating objects clearly dominates the process here, so I'd try avoiding it as much as I can. – Sergey Kalinichenko Aug 15 '12 at 11:15
Yup, it doesn't solve anything but it introduces a lot of new side issues. – Halle Aug 15 '12 at 11:26
I can get it down to half the time if I create an NSSet at the start from the words to match as follows: NSSet *comparisonSet = [NSSet setWithArray:arrayOfWordsToMatch]; and then I remove everything after NSString *key = [line substringWithRange:keyRange]; and check key against the word list instead as follows: if([comparisonSet containsObject:key]) and then just store the hits. Unfortunately this is still almost 3x the time of the regex. – Halle Aug 15 '12 at 11:47
@Halle I do not think you can beat regex without preprocessing. Even a very fast search that goes through the words one-by-one is a bit slower than a regex. You can try beating one regex with multiple ones (I'll see if I could come up with an alternative answer). – Sergey Kalinichenko Aug 15 '12 at 12:21

Halle · Answer 3 · 2012-08-15T16:05:30.707

Here is my hybrid approach of dasblinkenlight's and Ωmega's answers, which I thought I should add as an answer as well at this point. It uses dasblinkenlight's method of doing a forward search through the string and then performs the full regex on a small range in the event of a hit, so it exploits the fact that the dictionary and words to look up are both in alphabetical order and benefits from the optimized regex. Wish I had two best answer checks to give out! This gives the correct results and takes about half of the time of the pure regex approach on the Simulator (I have to test on the device later to see what the time comparison is on the iPhone 4 which is the reference device):

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^(?:"];
int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // this is all later rounds
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
    }
    [mutablePatternString appendString:[NSString stringWithFormat:@"%@",word]];
}

[mutablePatternString appendString:@")(?:\\([2-5]\\))?\t.*$"];

// This creates a string that reads "^(?:change|friday|model|monday|quidnunc|saturday|sunday|thursday|tuesday|wednesday)(?:\([2-5]\))?\t.*$"

// We don't want to instantiate the NSRegularExpression in the loop so let's use a pattern that matches everything we're interested in.

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                    options:NSRegularExpressionAnchorsMatchLines
                                                                                      error:nil];
NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {

        // If we find the first pronunciation, run the whole regex on a range of {position, 500} only.

        int rangeLocation = next.location;
        int searchPadding = 500;
        int rangeLength = searchPadding;

        if(data.length - next.location < searchPadding) { // Only use 500 if there is 500 more length in the data.
            rangeLength = data.length - next.location;
        } 

        [regularExpression enumerateMatchesInString:data 
                                            options:0
                                              range:NSMakeRange(rangeLocation, rangeLength)
                                         usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                             [matches addObject:[data substringWithRange:result.range]];
                                         }]; // Grab all the hits at once.

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutablePatternString release];
[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

// return matches to caller

EDIT: here is another version which uses no regex and shaves a little bit more time off of the method:

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {
        NSRange lineRange = [data lineRangeForRange:NSMakeRange(next.location+1, next.length)];
        [matches addObject:[data substringWithRange:NSMakeRange(lineRange.location, lineRange.length-1)]]; // Grab the whole line of the hit.
        int rangeLocation = next.location;
        int rangeLength = 750;

        if(data.length - next.location < rangeLength) { // Only use the searchPadding if there is that much room left in the string.
            rangeLength = data.length - next.location;
        } 
        rangeLength = rangeLength/5;
        int newlocation = rangeLocation;

        for(int i = 2;i < 6; i++) { // We really only need to do this from 2-5.
            NSRange morematches = [data
                            rangeOfString:[NSString stringWithFormat:@"\n%@(%d",[mutableArrayOfWordsToMatch objectAtIndex:0],i]
                            options:NSLiteralSearch
                            range:NSMakeRange(newlocation, rangeLength)
                            ];
            if(morematches.location != NSNotFound) {
                NSRange moreMatchesLineRange = [data lineRangeForRange:NSMakeRange(morematches.location+1, morematches.length)]; // Plus one because I don't actually want the line break at the beginning.
                 [matches addObject:[data substringWithRange:NSMakeRange(moreMatchesLineRange.location, moreMatchesLineRange.length-1)]]; // Minus one because I don't actually want the line break at the end.
                newlocation = morematches.location;

            } else {
                break;   
            }
        }

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

If you go this route, you could replace the code that searches with regexp by a loop that searches for `@"\n%@("`, because you know that there will be no other matches of `@"\n%@\t"`. — Sergey Kalinichenko, Aug 15 '12 at 14:24
That did help, it brought the Simulator time down from .04 (and change) to .035. I'll edit it in below. — Halle, Aug 15 '12 at 16:03

score 1 · Answer 4 · answered Aug 14 '12 at 17:26

Looking at the dictionary file you provided, I'd say that a reasonable strategy could be reading in the data and putting it into any sort of persistent data store.

Read through the file and create objects for each unique word, with n strings of pronunciations (where n is the number of unique pronunciations). The dictionary is already in alphabetical order, so if you parsed it in the order that you're reading it you'd end up with an alphabetical list.

Then you can do a binary search on the data - even with a HUGE number of objects a binary search will find what you're looking for very quickly (assuming alphabetical order).

You could probably even keep the whole thing in memory if you need lightning-fast performance.

I think it wouldn't be a great solution for this particular case, since it would be a new dependency for the framework. Thank you for the suggestion though, I'll keep it in mind. — Halle, Aug 14 '12 at 17:33
This is just a suggestion in case it turns out to not be possible to get the performance you need with regex. I can understand not wanting to change code that already works pretty well. — Dustin, Aug 14 '12 at 17:37

Regex pattern and/or NSRegularExpression a bit too slow searching over very large file, can it be optimized?

4 Answers4