Questions tagged [correlation]

For questions regarding interdependence of variable quantities.

Correlation is a measure of relationship between two or more mathematical variables or measured data values. It refers to any of a broad class of statistical relationships involving dependence. This refers to any situation in which random variables do not satisfy probabilistic independence. Tough correlation can refer to any departure of two or more random variables from independence, technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree of correlation.

The most common of these is the Pearson correlation coefficient, which is commonly called simply "the correlation coefficient". It is obtained by dividing the covariance of the two variables by the product of their standard deviations.

enter image description here

This parameter is sensitive only to a linear relationship between two variables (which may exist even if one is a nonlinear function of the other). The numerator of cor(X,Y) is known as the covariance between X and Y.

The Pearson correlation is +1 in the case of a perfect positive (increasing) linear relationship (correlation), −1 in the case of a perfect decreasing (negative) linear relationship (anticorrelation). The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables. As it approaches zero there is less of a relationship and we say that the data is uncorrelated.

Other correlation coefficients have been developed to be more robust than the Pearson correlation – that is, more sensitive to nonlinear relationships. Mutual information can also be applied to measure dependence between two variables.

Tag usage

Questions on tag should be about implementation and programming problems, not about the statistical or theoretical properties of the technique. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.

3058 questions
149
votes
9 answers

Use .corr to get the correlation between two columns

I have the following pandas dataframe Top15: I create a column that estimates the number of citable documents per person: Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita'] Top15['Citable docs per Capita'] =…
tong zhu
  • 1,625
  • 2
  • 7
  • 6
114
votes
14 answers

List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?

How do you find the top correlations in a correlation matrix with Pandas? There are many answers on how to do this with R (Show correlations as an ordered list, not as a large matrix or Efficient way to get highly correlated pairs from large data…
Kyle Brandt
  • 23,178
  • 32
  • 115
  • 158
93
votes
11 answers

Plot correlation matrix into a graph

I have a matrix with some correlation values. Now I want to plot that in a graph that looks more or less like that: How can I achieve that?
anon
89
votes
5 answers

How can I create a correlation matrix in R?

I have 92 set of data of same type. I want to make a correlation matrix for any two combination possible. i.e. I want a matrix of 92 x92. such that element (ci,cj) should be correlation between ci and cj. How do I do that?
Swapnil 'Tux' Takle
  • 1,177
  • 2
  • 9
  • 9
58
votes
6 answers

Correlation heatmap

I want to represent correlation matrix using a heatmap. There is something called correlogram in R, but I don't think there's such a thing in Python. How can I do this? The values go from -1 to 1, for example: [[ 1. 0.00279981 0.95173379 …
Kobe-Wan Kenobi
  • 3,246
  • 2
  • 31
  • 60
56
votes
18 answers

How to calculate correlation between all columns and remove highly correlated ones using pandas?

I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say…
jax
  • 2,972
  • 4
  • 29
  • 57
51
votes
4 answers

Calculate correlation with cor(), only for numerical columns

I have a dataframe and would like to calculate the correlation (with Spearman, data is categorical and ranked) but only for a subset of columns. I tried with all, but R's cor() function only accepts numerical data (x must be numeric, says the error…
wishihadabettername
  • 12,253
  • 20
  • 58
  • 83
46
votes
3 answers

Cross-correlation (time-lag-correlation) with pandas?

I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest. I found various questions and answers/links discussing how to do it with…
JC_CL
  • 1,568
  • 3
  • 15
  • 29
43
votes
6 answers

cor shows only NA or 1 for correlations - Why?

I'm running cor() on a data.framewith all numeric values and I'm getting this as the result: price exprice... price 1 NA exprice NA 1 ... So it's either 1 or NA for each value in the resulting table. Why are the NAs…
Dave
  • 5,055
  • 11
  • 43
  • 72
42
votes
5 answers

How to visualize correlation matrix as a schemaball in Matlab

I have 42 variables and I have calculated the correlation matrix for them in Matlab. Now I would like to visualize it with a schemaball. Does anyone have any suggestions / experiences how this could be done in Matlab? The following pictures will…
jjepsuomi
  • 3,601
  • 6
  • 37
  • 68
40
votes
3 answers

Dealing with missing values for correlations calculation

I have huge matrix with a lot of missing values. I want to get the correlation between variables. 1. Is the solution cor(na.omit(matrix)) better than below? cor(matrix, use = "pairwise.complete.obs") I already have selected only variables…
Delphine
  • 1,023
  • 5
  • 15
  • 22
38
votes
2 answers

Correlated features and classification accuracy

I'd like to ask everyone a question about how correlated features (variables) affect the classification accuracy of machine learning algorithms. With correlated features I mean a correlation between them and not with the target class (i.e the…
35
votes
3 answers

Remove highly correlated variables

I have a huge dataframe 5600 X 6592 and I want to remove any variables that are correlated to each other more than 0.99 I do know how to do this the long way, step by step i.e. forming a correlation matrix, rounding the values, removing similar ones…
Error404
  • 6,329
  • 13
  • 43
  • 57
34
votes
4 answers

How to interpret the values returned by numpy.correlate and numpy.corrcoef?

I have two 1D arrays and I want to see their inter-relationships. What procedure should I use in numpy? I am using numpy.corrcoef(arrayA, arrayB) and numpy.correlate(arrayA, arrayB) and both are giving some results that I am not able to comprehend…
khan
  • 5,557
  • 11
  • 44
  • 64
30
votes
3 answers

Computing the correlation coefficient between two multi-dimensional arrays

I have two arrays that have the shapes N X T and M X T. I'd like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively). What's the fastest, most pythonic way to do this? (Looping…
dbliss
  • 8,060
  • 13
  • 43
  • 74
1
2 3
99 100