I have an issue trying to count the number of <a ... </a>
hyperlinks/tags in an imported url string from any given website. As well as counting the number of char occurrences on the same string. The latter seems to work, with my code so far being:
let countChars (url:string) (tag: 'a) =
let link = fetchUrl (url)
let rec loop i count =
if i < link.Length then
if (link.[i] = tag) then loop (i+1) (count+1)
else loop (i+1) count
else count
loop 0 0
i am using the following to define my fetchUrl function:
let fetchUrl (url:string) : string =
let req = WebRequest.Create(Uri(url))
use resp = req.GetResponse()
use stream = resp.GetResponseStream()
use reader = new IO.StreamReader(stream)
in reader.ReadToEnd()
However I am stuck currently, as I am having a problem figuring out how exactly I would go about counting the tags on the imported url string. In my example here I loop over the url string, only counting the occurrence of a char, such as 'a', but I can't seem to find a way to apply this to my solution where as the substring I am actually searching for is an expression of several characters containing the <a ... </a>
expressions.
I have another solution that implements regular expressions to try and deal with the actual expression of tag that I am looking for. This code runs, but the return value is crazy:
let countTags (url:string) (tag:string) =
let link = fetchUrl (url)
let m = Regex.Match(link,tag)
let rec loop i count =
if i < link.Length then
if m.Success then loop (i+1) (count+1)
else loop (i+1) count
else count
loop 0 0
The results I am getting from calling this function with the following is shown to the right.
printfn "%A" (countTags "https://forum.astronomisk.dk/" "(?s)<a [^>]*?>(?<text>.*?)</a>") --> result: 75640
printfn "%A" (countTags "https://www.ku.dk/" "(?s)<a [^>]*?>(?<text>.*?)</a>") --> result: 57459
printfn "%A" (countTags "https://www.google.com/" "(?s)<a [^>]*?>(?<text>.*?)</a>") --> result: 47120
The results corresponds to my definition of "crazy" (given the links in this case returns around 47-75k <a href=....</a>
tags for the 3 simple imported url strings). Calling the first function on imported url string with the same tests only looking for the char 'a' yields result around 2500-3000 which by my accounts is pretty reasonable, and seems to be working just fine.
Can anyone see what I am missing here? Is my implementation of the regular expression incorrect since it returns such a high result? Or is there some other way to go about counting the number of the <a ... </a>
tags in any given imported url string. I have tried to find the solution all day, without being able to definitively close the project with a successful code.
Any help fixing what little I remain is appreciated!