6

I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?

I have tried building a model like this (which I could do in a loop for each variable i = 2:90):

y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2

quadratic.model = lm(y ~ x + x2)

And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?

Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.

dorien
  • 4,828
  • 9
  • 43
  • 99
  • What is the structure of `data`? Is it a list of vectors? are all the vectors the same length? – Keith Hughitt Aug 01 '16 at 10:06
  • They are all the same length. I read them in with data = read.csv("file", header = TRUE). I think it had to do with the headers, I changed the question to reflect the working code. – dorien Aug 01 '16 at 10:12
  • 1
    If `data` is a `data.frame` then `data[1]` gives you a one column `data.frame` while `lm` expects a vector. Use `data[[1]]` to get the vector. – snaut Aug 01 '16 at 10:14
  • Please define what you understand as "correlation". The Spearman correlation coefficient tests for monotonic relationships. – Roland Aug 01 '16 at 10:16
  • Indeed, if you can just combine all of the variables into a single matrix, then you can get all of the pairwise spearman correlations using `cor(dat, method='spearman')`. – Keith Hughitt Aug 01 '16 at 10:18
  • The problem is that they will have a non-monotonic relationship (so parabola like). I was wondering if I could capture that type of correlation in some way. The goal in the end is to find about 4 variables which are significant to build a non-linear lm model. – dorien Aug 01 '16 at 10:21
  • "The goal in the end is to find about 4 variables which are significant to build a non-linear lm model." Then you are not approaching this in a good way. – Roland Aug 01 '16 at 10:23
  • @Roland. I was wondering if there is a function to see an overview of the highly correlated (non-linear) variables, in order to be more informed when building an lm. If there is a better way to approach this I would love to know... – dorien Aug 01 '16 at 10:26
  • Thanks snaut, that really helps to make a loop :) – dorien Aug 01 '16 at 10:32
  • Is there a way to do stepwise regression with non-linear formulas perhaps? – dorien Aug 01 '16 at 10:33
  • 1
    I don't know why you want to model this, but if the relationships are not linear a Generalized Additive Model is probably preferable. The implementation in package mgcv can remove variables. – Roland Aug 01 '16 at 10:33
  • If you want to build ``lm`` with quadratic term in x, you can use ``lm(y ~ x + I(x)^2`` – Phann Aug 01 '16 at 11:13
  • Thanks Phann, but that would only be for 1 variable X, or all 90? Basically when I build it for 90 I will loose degrees of freedom, so I want to see in advance which ones would be potentially correlated – dorien Aug 01 '16 at 11:14
  • Maybe applying it on every column would help? ``lapply(df, function(x) lm(y ~ x + I(x^2))``. Mention that it is ``I(x^2)`` not as my comment before. With ``sapply(df, function(x) lm(y ~ x + I(x^2))[[1]][3])`` or similar you could get the important parameters of the model. – Phann Aug 01 '16 at 11:19

3 Answers3

5

You can use nlcor package in R. This package finds the nonlinear correlation between two data vectors. There are different approaches to estimate a nonlinear correlation, such as infotheo. However, nonlinear correlations between two variables can take any shape.

nlcor is robust to most nonlinear shapes. It works pretty well in different scenarios.

At a high level, nlcor works by adaptively segmenting the data into linearly correlated segments. The segment correlations are aggregated to yield the nonlinear correlation. The output is a number between 0 to 1. With close to 1 meaning high correlation. Unlike a pearson correlation, negative values are not returned because it has no meaning in nonlinear relationships.

More details about this package here

To install nlcor, follow these steps:

install.packages("devtools") 
library(devtools)
install_github("ProcessMiner/nlcor")
library(nlcor)

After you install it,

# Implementation 
x <- seq(0,3*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l")

sin(x) plot

# linear correlation is small
cor(x,y)
# [1] 6.488616e-17
# nonlinear correlation is more representative
nlcor(x,y, plt = T)
# $cor.estimate
# [1] 0.9774
# $adjusted.p.value
# [1] 1.586302e-09
# $cor.plot

using nlcor for sin(x)

As shown in the example the linear correlation was close to zero although there was a clear relationship between the variables that nlcor could detect.

Note: The order of x and y inside the nlcor is important. nlcor(x,y) is different from nlcor(y,x). The x and y here represent 'independent' and 'dependent' variables, respectively.

vahab najari
  • 71
  • 1
  • 4
2

Fitting a generalized additive model, will help you identify curvature in the relationships between the explanatory variables. Read the example on page 22 here.

George Dontas
  • 27,579
  • 17
  • 103
  • 140
  • Thank you. I tried this using the gam function. With too many variables I get an error about too few degrees of freedom though. So I'm thinking I should do this per variable first to see which ones are most suited. Or am I missing a function that gam can identify the variables? – dorien Aug 01 '16 at 11:50
1

Another option would be to compute mutual information score between each pair of variables. For example, using the mutinformation function from the infotheo package, you could do:

set.seed(1)

library(infotheo)

# corrleated vars (x & y correlated, z noise)
x <- seq(-10,10, by=0.5)
y <- x^2
z <- rnorm(length(x))

# list of vectors
raw_dat <- list(x, y, z)


# convert to a dataframe and discretize for mutual information
dat <- matrix(unlist(raw_dat), ncol=length(raw_dat))
dat <- discretize(dat)

mutinformation(dat)

Result:

|   |        V1|        V2|        V3|                                                                                            
|:--|---------:|---------:|---------:|                                                                                            
|V1 | 1.0980124| 0.4809822| 0.0553146|                                                                                            
|V2 | 0.4809822| 1.0943907| 0.0413265|                                                                                            
|V3 | 0.0553146| 0.0413265| 1.0980124| 

By default, mutinformation() computes the discrete empirical mutual information score between two or more variables. The discretize() function is necessary if you are working with continuous data transform the data to discrete values.

This might be helpful at least as a first stab for looking for nonlinear relationships between variables, such as that described above.

Keith Hughitt
  • 4,233
  • 5
  • 40
  • 50
  • Can I use this to arbitrary data set, i.e. not necessarily stationary? thanks – python novice Jan 24 '17 at 04:14
  • Hi @pythonnovice, I haven't worked with that type of data before, so I can't really say for sure. Probably the easiest thing to do would be to simulate some simple non-stationary data and try it out. – Keith Hughitt Jan 24 '17 at 13:23