I've read the proof for why $\int_0^\infty P(X >x)dx=E[X]$ for nonnegative random variables (located here) and understand its mechanics, but I'm having trouble understanding the intuition behind this formula or why it should be the case at all. Does anyone have any insight on this? I bet I'm missing something obvious.

This post itself could have served as a "mother post" to duplicates. The current choice of [mother post](https://math.stackexchange.com/questions/843845) has existing duplicatelinks which this post doesn’t. Please see the meta post on [(abstract) duplicates](https://math.meta.stackexchange.com/a/29382/356647). – Lee David Chung Lin Nov 13 '18 at 13:28
4 Answers
For the discrete case, and if $X$ is nonnegative, $E[X] = \sum_{x=0}^\infty x P(X = x)$. That means we're adding up $P(X = 0)$ zero times, $P(X = 1)$ once, $P(X = 2)$ twice, etc. This can be represented in array form, where we're adding columnbycolumn:
$$\begin{matrix} P(X=1) & P(X = 2) & P(X = 3) & P(X = 4) & P(X = 5) & \cdots \\ & P(X = 2) & P(X = 3) & P(X = 4) & P(X = 5) & \cdots \\ & & P(X = 3) & P(X = 4) & P(X = 5) & \cdots \\ & & & P(X = 4) & P(X = 5) & \cdots \\ & & & & P(X = 5) & \cdots\end{matrix}.$$
We could also add up these numbers rowbyrow, though, and get the same result. The first row has everything but $P(X = 0)$ and so sums to $P(X > 0)$. The second row has everything but $P(X =0)$ and $P(X = 1)$ and so sums to $P(X > 1)$. In general, the sum of row $x+1$ is $P(X > x)$, and so adding the numbers rowbyrow gives us $\sum_{x = 0}^{\infty} P(X > x)$, which thus must also be equal to $\sum_{x=0}^\infty x P(X = x) = E[X].$
The continuous case is analogous.
In general, switching the order of summation (as in the proof the OP links to) can always be interpreted as adding rowbyrow vs. columnbycolumn.
 52,894
 17
 169
 272
A hint and a proof.
Hint: if $X=x$ with full probability, the integral is the integral of $1$ on $(0,x)$, hence the LHS and the RHS are both $x$.
Proof: apply (Tonelli)Fubini to the function $(\omega,x)\mapsto\mathbf 1_{X(\omega)>x}$ and to the sigmafinite measure $P\otimes\mathrm{Leb}$ on $\Omega\times\mathbb R_+$. One gets $$ \int_\Omega\int_{\mathbb R_+}\mathbf 1_{X(\omega)>x}\mathrm dx\mathrm dP(\omega)=\int_\Omega\int_0^{X(\omega)}\mathrm dx\mathrm dP(\omega)=\int_\Omega X(\omega)\mathrm dP(\omega)=E(X), $$ while, using the shorthand $A_x=\{\omega\in\Omega\mid X(\omega)>x\}$, $$ \int_{\mathbb R_+}\int_\Omega\mathbf 1_{X(\omega)>x}\mathrm dP(\omega)\mathrm dx=\int_{\mathbb R_+}\int_\Omega\mathbf 1_{\omega\in A_x}\mathrm dP(\omega)\mathrm dx=\int_{\mathbb R_+}P(A_x)\mathrm dx=\int_{\mathbb R_+}P(X>x)\mathrm dx. $$
 271,033
 27
 280
 538

1I spent much time but I couldn't prove that the bivariable function $(\omega,t)\mapsto \mathbf 1_{\{(\omega,t)\,\,X(\omega)>t\}}$ is a measurable function respect to $\mathcal{F}\times\mathcal{B}$. It's just obvious that it's measurable when one of the variables is fixed. – Fardad Pouran Apr 25 '15 at 16:19

@FardadPouran Let $E=\{(\omega,t)\in\Omega\times\mathbb R_+ :X(\omega)>t\}$ and define the $\omega$ and $t$ sections of $E$ by \begin{align} E_\omega &= \{t\in\mathbb R^+ : (\omega,t)\in E\}\\ E^t &= \{\omega\in \Omega : (\omega,t)\in E\}. \end{align} The monotone class lemma implies that the maps $\omega\mapsto \lambda(E_\omega)$ and $t\mapsto \mathbb P(E^t)$ are $\mathcal F$ and $\mathcal B$ measurable, respectively. – Math1000 Jun 19 '16 at 14:29

2@Math1000 $\mathcal F\times\mathcal B(E)$ to be replaced by $(P\otimes\mathrm{Leb})(E)$. – Did Jun 19 '16 at 17:39

@Did Thanks for pointing out the typo. $$\mathbb P\times\lambda(E) = \int_\Omega \lambda(E_\omega)\,\mathsf d\mathbb P(\omega) = \int_{\mathbb R_+} \mathbb P(E^t)\,\mathsf d\lambda(t).$$ So the map $(\omega,t)\mapsto \mathsf 1_E$ is indeed $\mathcal F\otimes\mathcal B$measurable. – Math1000 Jun 19 '16 at 17:57

@Math1000, thank you, but how did you find $E$ is $\mathcal{F}\times\mathcal B$measurable before writing $P\times\lambda(E)$? Indeed, we only need its measurablity. – Fardad Pouran Jun 19 '16 at 20:37

9"A hint and a proof" but the OP asks about intuition (how intuitive can the sigmafinite measure on the tensor product of $P$ and Leb be??). The math in this answer is unnecessarily complex, compared to the question: hence the downvote from me. – Jimmy R. Aug 18 '17 at 07:39
Since the intuition behind the result is requested, let us consider a simple case of a discrete nonnegative random variable taking on the three values $x_0 = 0$, $x_1$, and $x_2$ with probabilities $p_0$, $p_1$, and $p_2$. The cumulative distribution function (CDF) $F(x)$ is thus a staircase function $$F(x) = \begin{cases} 0, & x < 0, \\ p_0, & 0 \leq x < x_1,\\ p_0 + p_1, & x_1 \leq x < x_2,\\ 1, & x \geq x_2, \end{cases}$$ with jumps of $p_0$, $p_1$, and $p_2$ at $0$, $x_1$, and $x_2$ respectively. Note also that $$ E[X]= \sum_{i=1}^3 p_ix_i = p_1x_1 + p_2x_2. $$ Now, notice that $$\int_0^\infty P\{X > x\}\mathrm dx = \int_0^\infty [1  F(x)]\mathrm dx$$ is the area of the region bounded by the curve $F(x)$, the vertical axis, and the line at height 1 above the horizontal axis. Standard Riemann integration techniques say that we should divide the region into narrow vertical strips, compute the area of each, take the sum, take limits etc. In our example, of course, all this can be bypassed since the region in question is the union of two adjoining nonoverlapping rectangles: one of base $x_1$ and height $(1p_0)$, and the other of base $x_2  x_1$, and height $(1p_0p_1)$. BUT, suppose we divide the region under consideration into two different adjoining nonoverlapping rectangles with the second lying above the first. The first rectangle has base $x_1$ and height $p_1$, while the second (lying above the first) has broader base $x_2$ and height $p_2$. The total area that we seek is easily seen to be $p_1x_1 + p_2x_2 = E[X]$.
Thus, for a nonnegative random variable, $E[X]$ can be interpreted as the area of the region lying above its CDF $F(x)$ and below the line at height 1 to the right of the origin. The standard formula $$E[X] = \int_0^\infty x\mathrm dF(x)$$ can be thought of as computing this area by dividing it into thin horizontal strips of length $x$ and height $dF(x)$, while $$\int_0^\infty P\{X > x\}\mathrm dx = \int_0^\infty [1  F(x)]\mathrm dx$$ (in the Riemann integral sense) can be thought of as computing the area by dividing it into thin vertical strips.
More generally, if $X$ takes on both positive and negative values, $$E[X] = \int_0^\infty [1  F(x)]\mathrm dx  \int_{\infty}^0 F(x) \mathrm dx$$ with similar interpretations.
 23,545
 2
 42
 109
Perhaps by considering this question with a concrete physical example, it will provide some intuition.
Consider a beam of length $L = 10$ (you can pick your favorite units) attached to a wall. Now, at positions $1, 2, \ldots, 9$ hang weights $w_1,w_2,\ldots,w_9$. For simplicity, let's assume $\sum_{n=1}^9 w_n = 1$.
Then the center of mass of the beam is $c = \sum_{n=1}^9 n w_n$. below is an example picture, with the weights in blue (heights proportional to weight) and the center of mass in red.
In a probabilistic setting, our weights correspond to probabilities and $c = \mathbb E X$ where $X$ takes on the values $1,2,\ldots,9$ with probabilities $w_1,w_2,\ldots,w_9$, respectively.
Now, to explain how $c = \mathbb E X = \sum_{n=0}^9 \mathbb P(X > n) = \sum_{n = 0}^9 \sum_{k=n+1}^9 w_k$ comes about, expanding out the latter sum we have $$ c = (w_1 + \cdots + w_9) + (w_2 + \cdots + w_9) + \cdots + (w_9) \>, $$ so, $w_1$ appears once, $w_2$ appears twice, $w_3$ appears three times, etc. Hence $c = \sum_{n=1}^9 n w_n$.
In terms of the beam, we can think of the expression $\sum_{n=0}^9 \mathbb P(X > n)$ in the following way. Standing at zero, look out to the right and count up all the weights in front of you. Now, move one step to the right and repeat this process, adding the result to your initial sum. Continue this process until you get out to position 9, at which point there are no more weights in front of you.
The resulting sum is the center of mass, or, in probabilistic terms, the expectation $\mathbb E X$.
Extending this intuition to discrete random variables taking on noninteger values is straightforward. The extension to continuous variables is also not difficult.
 7,140
 3
 36
 48