0

I am trying to parse an html page that contains these values:

<a href="somesite.html?id=123">...</a>
<a href="somesite.html?id=456">...</a>
<a href="somesite.html?id=789">...</a>
<a href="anothersite.html">...</a>

How would I parse the Html String to get back an array of where it only contains the somesite.html:

["somesite.html?id=123", "somesite.html?id=456", "somesite.html?id=456"]

Edited

Using Zhiguo Wang's base answer, I can't seem to get only the somesite.html id values... The 3rd item in the array contains excess characters:

let htmlString = "<a href=\"somesite.html?id=123\">...</a>" +
"<a href=\"somesite.html?id=456\">...</a>" +
"<a href=\"somesite.html?id=789\">...</a>" +
"<a href=\"anothersite.html\">...</a>\""
let seperateComponent = "<a href=\"somesite.html?id="

let linkExp = "[\\w\\W]*\">"

Returns this value:

["123", "456", "789\\">...</a><a href=\\"anothersite.html"]

Expected Value: ["123", "456", "789"]

...hmm. Changing linkExp to the below resolves it. What does \W represent in Regex?

let linkExp = "[\\w]*\">"

..The length is wrong. Casted to NSString to grabbed the proper length.

Edited 2

It looks like if this string comes first before the somesite, then it includes Origin in the array:

<meta name=\"referrer\" content=\"origin\">
Tim Nuwin
  • 2,369
  • 1
  • 23
  • 58

3 Answers3

1

talk is cheap, show me the code

    let htmlString = "<a href=\"somesite.html?id=123\">...</a><a href=\"somesite.html?id=456\">...</a><a href=\"somesite.html?id=789\">...</a>"
    let seperateComponent = "<a href=\""

    let linkExp = "[\\w\\W]*\">"
    let linkRegExp = NSRegularExpression(pattern:linkExp, options: NSRegularExpressionOptions.CaseInsensitive, error: nil)
    let seperatedArray = htmlString.componentsSeparatedByString(seperateComponent)
    var resultArray = [String]()

    if seperatedArray.count > 1 {
        for seperatedString in seperatedArray {
            if seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) > 3{
                let myRange = linkRegExp!.rangeOfFirstMatchInString(seperatedString, options:NSMatchingOptions.ReportCompletion, range: NSMakeRange(0, seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)))
                if myRange.location != NSNotFound {
                    let matchString = (seperatedString as NSString).substringWithRange(myRange)
                    let linkString = (matchString as NSString).substringToIndex(matchString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) - 2)

                    resultArray.append(linkString)
                }
            }
        }
    }

    println(resultArray)

these codes have been run on xcode 6.4 and the result is right.sorry " i need at least 10 reputation to post images" so result pic won't be posted here.

  • Note that this will crash if the input string contains multiple non-ASCII characters like "ÄÖÜ" or "€". The reason is that counting UTF-8 bytes is *not* the right method to compute an NSRange. Compare e.g. http://stackoverflow.com/questions/27880650/swift-extract-regex-matches. – Martin R Sep 22 '15 at 05:37
  • Thanks Zhiguo, I have updated the question to handle one last test case. – Tim Nuwin Sep 22 '15 at 12:25
  • I can't seem to get it to work if i set the separatorComponent to: let seperateComponent = " – Tim Nuwin Sep 22 '15 at 23:14
  • Thanks Martin R,that's really a serious situation that i've never thought about. – Zhiguo Wang Sep 25 '15 at 08:53
  • 1
    And @TimNuwin i know your problem now ,and i'm glad to help 1.\W stands for all the capital characters from A to Z and numbers and other common symbols. 2.if you let seperateComponent = " – Zhiguo Wang Sep 25 '15 at 09:16
0

I think regular expression may go for a toss while parsing HTML files. You have better way of parsing HTML files the iOS way. Here is a tutorial on this. TFHpple and NDHpple are your friends here.

Here is a related SO thread.

Community
  • 1
  • 1
Abhinav
  • 36,284
  • 39
  • 178
  • 301
0

here's the improved code

    let htmlString = "<a href=\"somesite.html?id=123\">...</a>" +
        "<a href=\"somesite.html?id=456\">...</a>" +
        "<a href=\"somesite.html?id=789\">...</a>" +
    "<a href=\"anothersite.html\">...</a>\""
    let seperateComponent = "<a href=\""

    let linkExp = "[\\w\\W]*\">"
    let linkRegExp = NSRegularExpression(pattern:linkExp, options: NSRegularExpressionOptions.CaseInsensitive, error: nil)
    let seperatedArray = htmlString.componentsSeparatedByString(seperateComponent)
    var resultArray = [String]()

    if seperatedArray.count > 1 {
        for seperatedString in seperatedArray {
            if seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) > 3{
                let myRange = linkRegExp!.rangeOfFirstMatchInString(seperatedString, options:NSMatchingOptions.ReportCompletion, range: NSMakeRange(0, seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)))
                if myRange.location != NSNotFound {
                    let matchString = (seperatedString as NSString).substringWithRange(myRange)

                    let linkWished = "somesite.html?id="

                    if matchString.componentsSeparatedByString(linkWished).count > 1{

                        var linkString = (matchString as NSString).substringFromIndex(linkWished.lengthOfBytesUsingEncoding(NSUTF8StringEncoding))

                        linkString = (linkString as NSString).substringToIndex(linkString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) - 2)

                        resultArray.append(linkString)
                    }


                }
            }
        }
    }

    println(resultArray)