43

Why is this code only spitting out the entire regex match instead of the capture group?

Input

@"A long string containing Name:</td><td>A name here</td> amongst other things"

Output expected

A name here

Actual output

Name:</td><td>A name here</td>

Code

NSString *htmlString = @"A long string containing Name:</td><td>A name here</td> amongst other things";
NSRegularExpression *nameExpression = [NSRegularExpression regularExpressionWithPattern:@"Name:</td>.*\">(.*)</td>" options:NSRegularExpressionSearch error:nil];

NSArray *matches = [nameExpression matchesInString:htmlString
                                  options:0
                                    range:NSMakeRange(0, [htmlString length])];
for (NSTextCheckingResult *match in matches) {
    NSRange matchRange = [match range];
    NSString *matchString = [htmlString substringWithRange:matchRange];
    NSLog(@"%@", matchString);
}

Code taken from Apple docs. I know there are other libraries to do this but i want to stick with what's built in for this task.

Maciej Swic
  • 10,562
  • 8
  • 46
  • 64

4 Answers4

74

You will access the first group range using :

for (NSTextCheckingResult *match in matches) {
    //NSRange matchRange = [match range];
    NSRange matchRange = [match rangeAtIndex:1];
    NSString *matchString = [htmlString substringWithRange:matchRange];
    NSLog(@"%@", matchString);
}
11

Don't parse HTML with regular expressions or NSScanner. Down that path lies madness.

This has been asked many times on SO.

parsing HTML on the iPhone

The data i am picking out is as simple as <td>Name: A name</td> and i think its simple enough to just use regular expressions instead of including a full blown HTML parser in the project.

Up to you and I'm a strong advocate for "first to market has huge advantage".

The difference being that with a proper HTML parser, you are considering the structure of the document. Using regular expressions, you are relying on the document never changing format in ways that are syntactically otherwise perfectly valid.

I.e. what if the input were <td class="name">Name: A name</td>? Your regex parser just broke on input that is both valid HTML and, from a tag contents perspective, identical to the original input.

Community
  • 1
  • 1
bbum
  • 160,467
  • 23
  • 266
  • 355
  • Why not if you only need a couple strings? No lists etc. – Maciej Swic Jul 26 '11 at 00:16
  • 1
    I.e. The input will never change in structure, the ordering of tags will never change, the code will never be reused or refactored, the tags will never hav attributes added, and the encoding will never change? Sure. Go for it. (seriously-- if it is good enough for now... Go for it. Just know that you've got a time suck on your hands that may crop up someday.) – bbum Jul 26 '11 at 11:36
  • Well, if you use a parser, you still have to know where in the document the data is, and a change in the original document will affect your parsing no matter how you do it. The data i am picking out is as simple as Name: A name and i think its simple enough to just use regular expressions instead of including a full blown HTML parser in the project. – Maciej Swic Jul 28 '11 at 14:02
  • Regular expressions aren't sophisticated enough to properly parse HTML because HTML is not a regular language. You'd have to make a lot of assumptions about the input, such as limiting the nesting level. Your program will fail on any input that violates those rigid expectations. Are you willing to take that risk? –  Jul 28 '11 at 17:07
  • Yes, because im after simple strings like the one above, and it works perfectly now that im able to access the capture groups. – Maciej Swic Jul 29 '11 at 04:13
  • What if the input changes, such as the addition of the tag attribute in the last example in this answer? –  Jul 29 '11 at 16:12
  • @Preston If you're pulling content from a HTML page which you don't control and which you have no guarantees about the format of, you run the risk of your code breaking whether you use a HTML parser *or* a regex. Sure, you reduce the fragility a little by using a HTML parser - the case like extra attributes being added breaks the regex solution but not the HTML parser solution, for instance - but nothing is going to protect you from the page being restructured, or the table content being moved into nested divs instead of a table, or the site going down. Your code is fragile either way. – Mark Amery Aug 05 '13 at 14:58
  • @Preston Now don't get me wrong - pulling content out of HTML with a good parser and XPath is going to be a bit simpler, a bit more readable, and a bit less fragile than doing it with regex. But the advantage, when extracting just a single value like the OP is, is small. For someone unfamiliar with them, learning how to use lxml or Xpath from scratch to pull one or two values out of some HTML is overkill for the marginal benefits it offers. – Mark Amery Aug 05 '13 at 15:03
3

In swift3

//: Playground - noun: a place where people can play

import UIKit

/// Two groups. 1: [A-Z]+, 2: [0-9]+
var pattern = "([A-Z]+)([0-9]+)"

let regex = try NSRegularExpression(pattern: pattern, options:[.caseInsensitive])

let str = "AA01B2C3DD4"
let strLen = str.characters.count
let results = regex.matches(in: str, options: [], range: NSMakeRange(0, strLen))

let nsStr = str as NSString

for a in results {

    let c = a.numberOfRanges 
    print(c)

    let m0 = a.rangeAt(0)  //< Ex: 'AA01'
    let m1 = a.rangeAt(1)  //< Group 1: Alpha chars, ex: 'AA'
    let m2 = a.rangeAt(2)  //< Group 2: Digital numbers, ex: '01'
    // let m3 = a.rangeAt(3) //< Runtime exceptions

    let s = nsStr.substring(with: m2)
    print(s)
}
AechoLiu
  • 15,710
  • 9
  • 85
  • 113
3

HTML isn't a regular language and can't be properly parsed using regular expressions. Here's a classic SO answer explaining this common programmer misassumption.

Community
  • 1
  • 1
  • 3
    Dont parse HTML with regex because "Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp."? Grow up. – Maciej Swic Jul 29 '11 at 04:14
  • 5
    No, don't parse HTML with regex because you can't parse HTML with regex. –  Jul 29 '11 at 16:04
  • This needs clarification. Regex is a bad tool for parsing (X)HTML, but there is nothing in the nature of (X)HTML or regex that makes it "wrong". A regex applied to (X)HTML will behave as expected, it's just a bad tool for the job. – Balthazar Aug 18 '15 at 16:23