6

A while back I wrote a function to fill time series matrices that had NA values up according to the specifications needed and it's had occational uses on a few matrices that are about 50000 rows, 350 columns. The matrix can contain either numeric or character values. The main problem is fixing the matrix is slow and I thought I'd gauge some experts on how to do this faster.

I guess going to rcpp or paralleling it might help but I think it's might be my design rather than R itself that's inefficient. I generally vecotrize everything in R but since the missing values follow no pattern I've found no other way than to work with the matrix on a per row basis.

The function needs to be called so it can carry forwards missing values and also be called to quickly just fill the latest values with the last known one.

Here is an example matrix:

testMatrix <- structure(c(NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, 
 4.03, NA, NA, NA, 29.98, 66.89, NA, -12.78, -11.65, NA, NA, NA, 
 NA, NA, 29.98, 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, NA, 
 66.89, NA, -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, 
 -12.78, NA, NA, 4.76, NA, NA, NA, 29.98, 66.89, NA, -12.78, NA, 
 NA, 4.39, NA, NA, NA, 29.98, 66.89, NA, -10.72, -11.65, NA, 4.39, 
 NA, NA, NA, 29.98, 50.65, NA, -10.72, -11.65, NA, 4.39, NA, NA, 
 4.72, NA, 50.65, NA, -10.72, -38.61, 45.3, NA), .Dim = c(10L, 
 9L), .Dimnames = list(c("ID_a", "ID_b", "ID_c", "ID_d", "ID_e", 
 "ID_f", "ID_g", "ID_h", "ID_i", "ID_j"), c("2010-09-30", "2010-10-31", 
 "2010-11-30", "2010-12-31", "2011-01-31", "2011-02-28", "2011-03-31", 
 "2011-04-30", "2011-05-31")))

print(testMatrix)
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98         NA      29.98      29.98      29.98      29.98         NA
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65         NA         NA         NA         NA     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03         NA       4.76       4.76       4.76       4.39       4.39       4.39         NA

This is the function I currently use:

# ----------------------------------------------------------------------------
# GetMatrixWithBlanksFilled
# ----------------------------------------------------------------------------
#
# Arguments:
# inputMatrix --- A matrix with gaps in the time series rows
# fillGapMax  --- The max number of columns to carry a number
#                 forward if there are no more values in the
#                 time series row.
#
# Returns:
# A matrix with gaps filled.

GetMatrixWithBlanksFilled <- function(inputMatrix, fillGapMax = 6, forwardLooking = TRUE) {

    if("DEBUG_ON" %in% ls(globalenv())){browser()}

    cntRow <- nrow(inputMatrix)
    cntCol <- ncol(inputMatrix)

    # 
    if (forwardLooking) {
        for (i in 1:cntRow) {
            # Store the location of the first non NA element in the row
            firstValueCol <- (1:cntCol)[!is.na(inputMatrix[i,])][1]
            if (!(is.na(firstValueCol))) {
                if (!(firstValueCol == cntCol)) {
                    nextValueCol <- firstValueCol
                    # If there is a a value number in the row and it's not at the end of the time
                    # series, start iterating through the row while there are more NA values and
                    # more data values and not at the end of the row continue.
                    while ((sum(as.numeric(is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && (sum(as.numeric(!is.na(inputMatrix[i,nextValueCol:cntCol]))))>0 && !(nextValueCol == cntCol)) {
                        # Find the next NA element
                        nextNaCol <- (nextValueCol:cntCol)[is.na(inputMatrix[i,nextValueCol:cntCol])][1]
                        # Find the next value element
                        nextValueCol <- (nextNaCol:cntCol)[!is.na(inputMatrix[i,nextNaCol:cntCol])][1]
                        # If there is another value element then fill up all NA elements in between with the last known value
                        if (!is.na(nextValueCol)) {
                            inputMatrix[i,nextNaCol:(nextValueCol-1)] <- inputMatrix[i,(nextNaCol-1)]
                        } else {
                            # If there is no other value element then fill up all NA elements up to the max number supplied
                            # with the last known value unless it's close to the end of the row then just fill up to the end.
                            inputMatrix[i,nextNaCol:min(nextNaCol+fillGapMax,cntCol)] <- inputMatrix[i,(nextNaCol-1)]
                            nextValueCol <- cntCol
                        }
                    }
                }
            }
        }
    } else {
        for (i in 1:cntRow) {
            if (is.na(inputMatrix[i,ncol(inputMatrix)])) {
                tempRow <- inputMatrix[i,max(1,length(inputMatrix[i,])-fillGapMax):length(inputMatrix[i,])]
                if (length(tempRow[!is.na(tempRow)])>0) {
                    lastNonNaLocation <- (length(tempRow):1)[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
                    inputMatrix[i,(ncol(inputMatrix)-lastNonNaLocation+2):ncol(inputMatrix)] <- tempRow[!is.na(tempRow)][length(tempRow[!is.na(tempRow)])]
                }
            }
        }
    }

    return(inputMatrix)
}

I'm then calling it with something like:

> fixedMatrix1 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=12,forwardLooking=TRUE)
> print(fixedMatrix1)
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98      29.98      29.98      29.98      29.98      29.98      29.98
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03       4.03       4.76       4.76       4.76       4.39       4.39       4.39       4.39

or

> fixedMatrix2 <- GetMatrixWithBlanksFilled(testMatrix,fillGapMax=1,forwardLooking=FALSE)
> print(fixedMatrix2)
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98         NA      29.98      29.98      29.98      29.98      29.98
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65         NA         NA         NA         NA     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03         NA       4.76       4.76       4.76       4.39       4.39       4.39       4.39

This example runs quickly but is there any way to make it faster for large matrices?

> n <- 38
> m <- 5000
> bigM <- matrix(rep(testMatrix,n*m),m*nrow(testMatrix),n*ncol(testMatrix),FALSE)
> system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
   user  system elapsed 
  86.47    0.06   87.24

This dummy one has a lot of NA only rows and completely filled ones but the normal ones can take about 15-20 min.

UPDATE

Regarding Charles' comment about na.locf not completely mirroring the logic of the above: Below is a simplified version of how the final function is excluding checks for inputs etc:

FillGaps <- function( dataMatrix, fillGapMax ) {

    require("zoo")

    numRow <- nrow(dataMatrix) 
    numCol <- ncol(dataMatrix) 

    iteration <- (numCol-fillGapMax)

    if(length(iteration)>0) {
        for (i in iteration:1) {
            tempMatrix <- dataMatrix[,i:(i+fillGapMax),drop=FALSE]
            tempMatrix <- t(zoo::na.locf(t(tempMatrix), na.rm=FALSE, maxgap=fillGapMax))
            dataMatrix[,i:(i+fillGapMax)] <- tempMatrix
        }
    }

    return(dataMatrix)
}
Hansi
  • 2,176
  • 1
  • 13
  • 19
  • This seems like a case when you need native code accelerator -- if this method has a name, try finding it in CRAN; if it is not there or not, you should implement it in C or Fortran and run from R (although it is easy -- see R-ext). Or use Rcpp for even easier conversion. – mbq Jun 16 '11 at 10:44

1 Answers1

6

I might be wrong but I think this is implemented in the zoo package: use the na.locf function.

With your given example matrix, first we should transpose it, and after calling the na function we 'retranspose' the result matrix. E.g.:

> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=12))
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98      29.98      29.98      29.98      29.98      29.98      29.98
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03       4.03       4.76       4.76       4.76       4.39       4.39       4.39       4.39

And with small maxgap:

> t(na.locf(t(testMatrix), na.rm=FALSE, maxgap=0))
     2010-09-30 2010-10-31 2010-11-30 2010-12-31 2011-01-31 2011-02-28 2011-03-31 2011-04-30 2011-05-31
ID_a         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_b         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_c         NA         NA         NA         NA         NA         NA         NA         NA       4.72
ID_d      29.98      29.98      29.98         NA      29.98      29.98      29.98      29.98         NA
ID_e      66.89      66.89      66.89      66.89      66.89      66.89      66.89      50.65      50.65
ID_f         NA         NA         NA         NA         NA         NA         NA         NA         NA
ID_g     -12.78     -12.78     -12.78     -12.78     -12.78     -12.78     -10.72     -10.72     -10.72
ID_h     -11.65     -11.65         NA         NA         NA         NA     -11.65     -11.65     -38.61
ID_i         NA         NA         NA         NA         NA         NA         NA         NA      45.30
ID_j       4.03         NA       4.76       4.76       4.76       4.39       4.39       4.39         NA

The performance gained using na.locf could be seen:

>  system.time(output <- GetMatrixWithBlanksFilled(bigM,fillGapMax=12,forwardLooking=TRUE))
   user  system elapsed 
 79.238   0.540  80.398 
> system.time(output <- t(na.locf(t(bigM), na.rm=FALSE, maxgap=12)))
   user  system elapsed 
 17.129   0.267  17.513 
daroczig
  • 26,353
  • 6
  • 84
  • 122
  • Thanks for that, I'd forgotten about na.locf despite having used it a few times before. Even after writing this function a while back. Works great now. Might be interesting to see how a rcpp version would fair against na.locf but I assume not much better. – Hansi Jun 16 '11 at 16:05
  • Note that `na.locf`'s maxgap has a slightly different interpretation from my understanding of your fillGapMax. na.locf won't fill a gap at all if its length is beyond maxgap, whereas I think your fillGapMax is the maximum amount of a gap that will be filled. – Charles Jun 16 '11 at 18:25
  • Yes Charles, I noticed that. It stills serves the main purpose if I set maxgap to a high enough number to link the time series. Although a few series may drop off by switching methods it's fine since at least at the end it will fill the end if the maxgap is further out than the NA's from the last value. And the speed improvement is more important than attempts at keeping time series viable. – Hansi Jun 16 '11 at 18:58