17

Cyclomatic complexity measures how many possible branches can be taken through a function. Is there an existing function/tool to calculate it for R functions? If not, suggestions are appreciated for the best way to write one.

A cheap start towards this would be to count up all the occurences of if, ifelse or switch within your function. To get a real answer though, you need to understand when branches start and end, which is much harder. Maybe some R parsing tools would get us started?

Richie Cotton
  • 107,354
  • 40
  • 225
  • 343
  • Related to this, it can be useful to see how many branches actually are taken. For that, various code coverage metrics can help. I don't yet know of any code coverage tools for R, though. – Iterator Aug 12 '11 at 14:19
  • Could [this question](http://stackoverflow.com/questions/125898/tool-for-calculating-cyclomatic-complexity) be related? The [metrics plugin](http://eclipse-metrics.sourceforge.net/) for Eclipse might just make me use Eclipse again, if this works for R. – Iterator Aug 12 '11 at 14:20
  • Sorry for the abundance of comments, but I prefer to post answers with answers. :) Can you clarify why you'd like the cyclomatic complexity? In practice, I find that handling of outliers and bad data require lots of separate if-statements in my code, which will run up the cyc. comp. a lot. This isn't bad, but the actual coverage of executed (both in testing and deployment) code is important for being sure that outliers and anomalies are handled. – Iterator Aug 12 '11 at 14:24
  • 1
    I erred - the metrics plugin is [just for Java](http://metrics.sourceforge.net/). What is the sound of one hope crushing? – Iterator Aug 12 '11 at 14:29
  • 2
    High cyclomatic complexity can be used as a proxy for "maybe I should refactor this code" or at least "Richie, you've gotten yourself muddled and overcomplicated things". It just gives another way of weeding out cr*ppy code. – Richie Cotton Aug 12 '11 at 14:43
  • You might really consider looking at code coverage relative to lines of code. If you have a lot of code that isn't executed in testing, then there is a problem. Also, code that is regularly used tends to be more correct (because usage is a form of testing :)) than unused code. – Iterator Aug 12 '11 at 14:48
  • @Brandon: `testthat` doesn't seem to have a function to measure of code complexity. – Richie Cotton Aug 12 '11 at 19:55
  • I don't have a proof (and if I did, it would be marvelous, yet not fit in the margins afforded by a comment box), but I suspect it is possible to create a program with infinite cylomatic complexity, where checking it would not be decidable. In short, I suspect that checking cyclomatic complexity could be a form of the halting problem. Then again, I just found that Wikipedia already mentions this: search for halting problem in [this entry](http://en.wikipedia.org/wiki/Code_coverage). It's not a proof, but I suspect it could be done. – Iterator Sep 05 '11 at 13:19
  • I think that `eval(parse(text = nefariousArbitraryInputStringWithLotsOfStochasticBranching))` could be a problem. – Iterator Sep 05 '11 at 13:24

2 Answers2

7

You can use codetools::walkCode to walk the code tree. Unfortunately codetools' documentation is pretty sparse. Here's an explanation and sample to get you started.

walkCode takes an expression and a code walker. A code walker is a list that you create, that must contain three callback functions: handler, call, and leaf. (You can use the helper function makeCodeWalker to provide sensible default implementations of each.) walkCode walks over the code tree and makes calls into the code walker as it goes.

call(e, w) is called when a compound expression is encountered. e is the expression and w is the code walker itself. The default implementation simply recurses into the expression's child nodes (for (ee in as.list(e)) if (!missing(ee)) walkCode(ee, w)).

leaf(e, w) is called when a leaf node in the tree is encountered. Again, e is the leaf node expression and w is the code walker. The default implementation is simply print(e).

handler(v, w) is called for each compound expression and can be used to easily provide an alternative behavior to call for certain types of expressions. v is the character string representation of the parent of the compound expression (a little hard to explain--but basically <- if it's an assignment expression, { if it's the start of a block, if if it's an if-statement, etc.). If the handler returns NULL then call is invoked as usual; if you return a function instead, that's what's called instead of the function.

Here's an extremely simplistic example that counts occurrences of if and ifelse of a function. Hopefully this can at least get you started!

library(codetools)

countBranches <- function(func) {
  count <- 0
  walkCode(body(func), 
           makeCodeWalker(
             handler=function(v, w) {
               if (v == 'if' || v == 'ifelse')
                 count <<- count + 1
               NULL  # allow normal recursion
             },
             leaf=function(e, w) NULL))
  count
}
Joe Cheng
  • 7,579
  • 38
  • 36
  • Thanks. `codetools` looks very useful, though I see what you mean about the sparse documentation. I'm gonna have to play with this a little before I can get my head around what's going on. – Richie Cotton Aug 19 '11 at 14:01
  • 1
    Nice! I also have a few draft functions at https://github.com/hadley/devtools/wiki/Computing-on-the-language. Personally, I don't think the codeWalker stuff is mature enough to have much advantage over doing it by hand. – hadley Aug 20 '11 at 15:19
4

Also, I just found a new package called cyclocomp (released 2016). Check it out!

areyoujokingme
  • 473
  • 2
  • 6
  • 20