0

I have an issue trying to count the number of <a ... </a> hyperlinks/tags in an imported url string from any given website. As well as counting the number of char occurrences on the same string. The latter seems to work, with my code so far being:

let countChars (url:string) (tag: 'a) =
    let link = fetchUrl (url)
    let rec loop i count =
        if i < link.Length then
            if (link.[i] = tag) then loop (i+1) (count+1)
            else loop (i+1) count
        else count
    loop 0 0

i am using the following to define my fetchUrl function:

let fetchUrl (url:string) : string =
    let req = WebRequest.Create(Uri(url))
    use resp = req.GetResponse()
    use stream = resp.GetResponseStream()
    use reader = new IO.StreamReader(stream)
    in reader.ReadToEnd()

However I am stuck currently, as I am having a problem figuring out how exactly I would go about counting the tags on the imported url string. In my example here I loop over the url string, only counting the occurrence of a char, such as 'a', but I can't seem to find a way to apply this to my solution where as the substring I am actually searching for is an expression of several characters containing the <a ... </a> expressions.

I have another solution that implements regular expressions to try and deal with the actual expression of tag that I am looking for. This code runs, but the return value is crazy:

let countTags (url:string) (tag:string) =
    let link = fetchUrl (url)
    let m = Regex.Match(link,tag)
    let rec loop i count =
        if i < link.Length then
            if m.Success then loop (i+1) (count+1)
            else loop (i+1) count
        else count
    loop 0 0

The results I am getting from calling this function with the following is shown to the right.

printfn "%A" (countTags "https://forum.astronomisk.dk/" "(?s)<a [^>]*?>(?<text>.*?)</a>") --> result: 75640

printfn "%A" (countTags "https://www.ku.dk/" "(?s)<a [^>]*?>(?<text>.*?)</a>") --> result: 57459

printfn "%A" (countTags "https://www.google.com/" "(?s)<a [^>]*?>(?<text>.*?)</a>") --> result: 47120

The results corresponds to my definition of "crazy" (given the links in this case returns around 47-75k <a href=....</a> tags for the 3 simple imported url strings). Calling the first function on imported url string with the same tests only looking for the char 'a' yields result around 2500-3000 which by my accounts is pretty reasonable, and seems to be working just fine.

Can anyone see what I am missing here? Is my implementation of the regular expression incorrect since it returns such a high result? Or is there some other way to go about counting the number of the <a ... </a> tags in any given imported url string. I have tried to find the solution all day, without being able to definitively close the project with a successful code.

Any help fixing what little I remain is appreciated!

Mr Lister
  • 42,557
  • 14
  • 95
  • 136
NewDev90
  • 352
  • 1
  • 10

1 Answers1

1

Your first problem is that countTags seems to count the length of the document you are looking at (if the document contains an anchor tag).

The reason for this is that Regex.Match searches for the first occurrence of a match, hence 'm.Success' is always true.

What you want is Regex.Matches. This gives you a MatchCollection that you can take ´.Count´ of.

Also take a look at this for a regex that matches anchor tags.

To clarify, you can do

let countTags (url:string) (tag:string) =
    let link = fetchUrl url
    let regex = Regex tag
    regex.Matches(link).Count
nilekirk
  • 2,190
  • 1
  • 8
  • 9
  • Hmm allright makes alot of sense, how would you go about implementing this? I see the idea of using the Regex.Matches instead, but then I need some way of implementing the MatchCollection in my counting if statement, let m = Regex.Mathces(link,tag) let rec loop i count = if i < link.Length then if m.Success then loop (i+1) (count+1) // this part needs redoing. – NewDev90 Dec 01 '18 at 16:15
  • I am still learning by the day in F#, but still a bit new, and I have not used the Matches or the MatchCollection before. – NewDev90 Dec 01 '18 at 16:18
  • Hmm allright that makes sense. I did find the answer I was looking for :-) thanks alot for taking the time! – NewDev90 Dec 01 '18 at 18:34
  • In this case, @KenLar, I'm sure nilekirk would appreciate if you upvoted and accepted the anwer as solution ;) – Markus Deibel Dec 03 '18 at 07:08
  • That make a lot of sense :) tbh I haven't done more than a handful of posts so far, so I wasn't really aware of this approach to close down my questions. But of cause it makes a lot of sense. – NewDev90 Dec 04 '18 at 09:46