Retrieving Array of Links with regex - Swift

Question

I am trying to parse an html page that contains these values:

<a href="somesite.html?id=123">...</a>
<a href="somesite.html?id=456">...</a>
<a href="somesite.html?id=789">...</a>
<a href="anothersite.html">...</a>

How would I parse the Html String to get back an array of where it only contains the somesite.html:

["somesite.html?id=123", "somesite.html?id=456", "somesite.html?id=456"]

Edited

Using Zhiguo Wang's base answer, I can't seem to get only the somesite.html id values... The 3rd item in the array contains excess characters:

let htmlString = "<a href=\"somesite.html?id=123\">...</a>" +
"<a href=\"somesite.html?id=456\">...</a>" +
"<a href=\"somesite.html?id=789\">...</a>" +
"<a href=\"anothersite.html\">...</a>\""
let seperateComponent = "<a href=\"somesite.html?id="

let linkExp = "[\\w\\W]*\">"

Returns this value:

["123", "456", "789\\">...</a><a href=\\"anothersite.html"]

Expected Value: ["123", "456", "789"]

...hmm. Changing linkExp to the below resolves it. What does \W represent in Regex?

let linkExp = "[\\w]*\">"

..The length is wrong. Casted to NSString to grabbed the proper length.

Edited 2

It looks like if this string comes first before the somesite, then it includes Origin in the array:

<meta name=\"referrer\" content=\"origin\">

@Wongzigii I feel like there's an easier solution than a 3rd party library. E.g all those a tags contain the same format of "somesite.html?id=". Can't regex do a find on those first characters up until the id=, then stop at the first double quotes? Idk how that would look though — Tim Nuwin, Sep 22 '15 at 04:18

Zhiguo Wang · Accepted Answer · 2015-09-22T05:34:16.537

talk is cheap, show me the code

    let htmlString = "<a href=\"somesite.html?id=123\">...</a><a href=\"somesite.html?id=456\">...</a><a href=\"somesite.html?id=789\">...</a>"
    let seperateComponent = "<a href=\""

    let linkExp = "[\\w\\W]*\">"
    let linkRegExp = NSRegularExpression(pattern:linkExp, options: NSRegularExpressionOptions.CaseInsensitive, error: nil)
    let seperatedArray = htmlString.componentsSeparatedByString(seperateComponent)
    var resultArray = [String]()

    if seperatedArray.count > 1 {
        for seperatedString in seperatedArray {
            if seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) > 3{
                let myRange = linkRegExp!.rangeOfFirstMatchInString(seperatedString, options:NSMatchingOptions.ReportCompletion, range: NSMakeRange(0, seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)))
                if myRange.location != NSNotFound {
                    let matchString = (seperatedString as NSString).substringWithRange(myRange)
                    let linkString = (matchString as NSString).substringToIndex(matchString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) - 2)

                    resultArray.append(linkString)
                }
            }
        }
    }

    println(resultArray)

these codes have been run on xcode 6.4 and the result is right.sorry " i need at least 10 reputation to post images" so result pic won't be posted here.

Note that this will crash if the input string contains multiple non-ASCII characters like "ÄÖÜ" or "€". The reason is that counting UTF-8 bytes is *not* the right method to compute an NSRange. Compare e.g. http://stackoverflow.com/questions/27880650/swift-extract-regex-matches. — Martin R, Sep 22 '15 at 05:37
Thanks Zhiguo, I have updated the question to handle one last test case. — Tim Nuwin, Sep 22 '15 at 12:25
I can't seem to get it to work if i set the separatorComponent to: let seperateComponent = " — Tim Nuwin, Sep 22 '15 at 23:14
Thanks Martin R,that's really a serious situation that i've never thought about. — Zhiguo Wang, Sep 25 '15 at 08:53
And @TimNuwin i know your problem now ,and i'm glad to help 1.\W stands for all the capital characters from A to Z and numbers and other common symbols. 2.if you let seperateComponent = " — Zhiguo Wang, Sep 25 '15 at 09:16

score 0 · Answer 2 · edited May 23 '17 at 11:44

0

I think regular expression may go for a toss while parsing HTML files. You have better way of parsing HTML files the iOS way. Here is a tutorial on this. TFHpple and NDHpple are your friends here.

Here is a related SO thread.

edited May 23 '17 at 11:44

Community

1
1

answered Sep 22 '15 at 04:35

Abhinav

36,284
39
178
301

score 0 · Answer 3 · answered Sep 25 '15 at 09:29

here's the improved code

    let htmlString = "<a href=\"somesite.html?id=123\">...</a>" +
        "<a href=\"somesite.html?id=456\">...</a>" +
        "<a href=\"somesite.html?id=789\">...</a>" +
    "<a href=\"anothersite.html\">...</a>\""
    let seperateComponent = "<a href=\""

    let linkExp = "[\\w\\W]*\">"
    let linkRegExp = NSRegularExpression(pattern:linkExp, options: NSRegularExpressionOptions.CaseInsensitive, error: nil)
    let seperatedArray = htmlString.componentsSeparatedByString(seperateComponent)
    var resultArray = [String]()

    if seperatedArray.count > 1 {
        for seperatedString in seperatedArray {
            if seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) > 3{
                let myRange = linkRegExp!.rangeOfFirstMatchInString(seperatedString, options:NSMatchingOptions.ReportCompletion, range: NSMakeRange(0, seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)))
                if myRange.location != NSNotFound {
                    let matchString = (seperatedString as NSString).substringWithRange(myRange)

                    let linkWished = "somesite.html?id="

                    if matchString.componentsSeparatedByString(linkWished).count > 1{

                        var linkString = (matchString as NSString).substringFromIndex(linkWished.lengthOfBytesUsingEncoding(NSUTF8StringEncoding))

                        linkString = (linkString as NSString).substringToIndex(linkString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) - 2)

                        resultArray.append(linkString)
                    }


                }
            }
        }
    }

    println(resultArray)

Retrieving Array of Links with regex - Swift

3 Answers3