I've read many times that the derivative of a function $f(x)$ for a certain $x$ is the best linear approximation of the function for values near $x$.

I always thought it was meant in a hand-waving approximate way, but I've recently read that:

"Some people call the derivative the “best linear approximator” because of how accurate this approximation is for $x$ near $0$ (as seen in the picture below). In fact, the derivative actually is the “best” in this sense – you can’t do better." (from http://davidlowryduda.com/?p=1520, where $0$ is a special case in the context of Taylor Series).

This seems to make it clear that the idea of "best linear approximation" is meant in a literal, mathematically rigorous way.

I'm confused because I believe that for a differentiable function, no matter how small you make the interval $\epsilon$ around $x$, there will always be for any $a$ near $x$ in that interval a line going through $x$ that is either as good an approximation of $f(a)$ as the one given by $f'(x)$ (in case the function is actually linear over that interval), or a better approximation (the case in which the line going through $(x, f(x))$ also goes through (a, f(a)) and any line between this line and the tangent at $x$).

What am I missing?

jeremy radcliff
  • 4,579
  • 6
  • 31
  • 53
  • 17
    You're right that, for any *finite* interval $[x-\epsilon,x+\epsilon]$, the "best linear approximation" may be different from the line with slope $f'(x)$. The idea is that as $\epsilon\to0$, the limit of the best linear approximation is given by $f'(x)$. –  May 13 '16 at 21:11
  • 4
    Try graphing the function MINUS the tangent line. You'll usually get a parabola with vertex at the point of tangency (thus, you'll get a curve that approaches the $x$-axis with a zero angle in the limit), and sometimes a cubic or something that is even flatter near the point of tangency. Now try graphing the function MINUS a line with nearly the same slope (and going through the same point as the tangent line does), and you'll see that you'll get a line that approaches the $x$-axis with a nonzero slope in the limit, which means that the error is at the level of a nonzero slope. – Dave L. Renfro May 13 '16 at 21:20
  • 1
    For more details, see my [21 May 2009 ap-calculus post](http://mathforum.org/kb/message.jspa?messageID=6720467) archived at Math Forum. – Dave L. Renfro May 13 '16 at 21:24
  • Your quote says "the derivative is the best linear approximation **in this sene**" (emphasis mine). That implies that some specific "sense" had already been given. What was that? – user247327 May 13 '16 at 23:06
  • 1
    http://math.stackexchange.com/questions/1783140/proving-fracddxx2-2x-by-definition/1783194#1783194 –  May 14 '16 at 00:01
  • 2
    By the title question's logic, then the second derivative is the best *circular* approximation near a point. – J. M. ain't a mathematician May 15 '16 at 02:48
  • 1
    I think you are missing that it is the best linear approximation to $f(x)$ near $x$ **which passes through the point** $(x, f(x))$. Without that restriction, there are certainly "better" approximations using well-established measures of "better" - for example a least-squares or minimax approximation in a finite interval around $x$. And those approximations might *not* converge to the derivative - inventing your own counterexamples is an instructive exercise. – alephzero May 16 '16 at 02:38

11 Answers11


As some people on this site might be aware I don't always take downvotes well. So here's my attempt to provide more context to my answer for whoever decided to downvote.

Note that I will confine my discussion to functions $f: D\subseteq \Bbb R \to \Bbb R$ and to ideas that should be simple enough for anyone who's taken a course in scalar calculus to understand. Let me know if I haven't succeeded in some way.

First, it'll be convenient for us to define a new notation. It's called "little oh" notation.

Definition: A function $f$ is called little oh of $g$ as $x\to a$, denoted $f\in o(g)$ as $x\to a$, if

$$\lim_{x\to a}\frac {f(x)}{g(x)}=0$$

Intuitively this means that $f(x)\to 0$ as $x\to a$ "faster" than $g$ does.

Here are some examples:

  • $x\in o(1)$ as $x\to 0$
  • $x^2 \in o(x)$ as $x\to 0$
  • $x\in o(x^2)$ as $x\to \infty$
  • $x-\sin(x)\in o(x)$ as $x\to 0$
  • $x-\sin(x)\in o(x^2)$ as $x\to 0$
  • $x-\sin(x)\not\in o(x^3)$ as $x\to 0$

Now what is an affine approximation? (Note: I prefer to call it affine rather than linear -- if you've taken linear algebra then you'll know why.) It is simply a function $T(x) = A + Bx$ that approximates the function in question.

Intuitively it should be clear which affine function should best approximate the function $f$ very near $a$. It should be $$L(x) = f(a) + f'(a)(x-a).$$ Why? Well consider that any affine function really only carries two pieces of information: slope and some point on the line. The function $L$ as I've defined it has the properties $L(a)=f(a)$ and $L'(a)=f'(a)$. Thus $L$ is the unique line which passes through the point $(a,f(a))$ and has the slope $f'(a)$.

But we can be a little more rigorous. Below I give a lemma and a theorem that tell us that $L(x) = f(a) + f'(a)(x-a)$ is the best affine approximation of the function $f$ at $a$.

Lemma: If a differentiable function $f$ can be written, for all $x$ in some neighborhood of $a$, as $$f(x) = A + B\cdot(x-a) + R(x-a)$$ where $A, B$ are constants and $R\in o(x-a)$, then $A=f(a)$ and $B=f'(a)$.

Proof: First notice that because $f$, $A$, and $B\cdot(x-a)$ are continuous at $x=a$, $R$ must be too. Then setting $x=a$ we immediately see that $f(a)=A$.

Then, rearranging the equation we get (for all $x\ne a$)

$$\frac{f(x)-f(a)}{x-a} = \frac{f(x)-A}{x-a} = \frac{B\cdot (x-a)+R(x-a)}{x-a} = B + \frac{R(x-a)}{x-a}$$

Then taking the limit as $x\to a$ we see that $B=f'(a)$. $\ \ \ \square$

Theorem: A function $f$ is differentiable at $a$ iff, for all $x$ in some neighborhood of $a$, $f(x)$ can be written as $$f(x) = f(a) + B\cdot(x-a) + R(x-a)$$ where $B \in \Bbb R$ and $R\in o(x-a)$.

Proof: "$\implies$": If $f$ is differentiable then $f'(a) = \lim_{x\to a} \frac{f(x)-f(a)}{x-a}$ exists. This can alternatively be written $$f'(a) = \frac{f(x)-f(a)}{x-a} + r(x-a)$$ where the "remainder function" $r$ has the property $\lim_{x \to a} r(x-a)=0$. Rearranging this equation we get $$f(x) = f(a) + f'(a)(x-a) -r(x-a)(x-a).$$ Let $R(x-a):= -r(x-a)(x-a)$. Then clearly $R\in o(x-a)$ (confirm this for yourself). So $$f(x) = f(a) + f'(a)(x-a) + R(x-a)$$ as required.

"$\impliedby$": Simple rearrangement of this equation yields

$$B + \frac{R(x-a)}{x-a}= \frac{f(x)-f(a)}{x-a}.$$ The limit as $x\to a$ of the LHS exists and thus the limit also exists for the RHS. This implies $f$ is differentiable by the standard definition of differentiability. $\ \ \ \square$

Taken together the above lemma and theorem tell us that not only is $L(x) = f(a) + f'(a)(x-a)$ the only affine function who's remainder tends to $0$ as $x\to a$ faster than $x-a$ itself (this is the sense in which this approximation is the best), but also that we can even define the concept differentiability by the existence of this best affine approximation.

  • 2,894
  • 2
  • 14
  • 26
  • 1
    I had always assumed that the meaning of "best" was that it produces the least deviation from the function when points are chosen close to $a$. That is, $\Big| \int_{a-\epsilon}^{a+\epsilon}(f(x)-g(x))dx\Big|$ is at a minimum for some sufficiently small $\epsilon$ if $g(x)=L(x)$. – Addem May 13 '16 at 21:28
  • **FWIW** "The derivative of a function is the best linear approximation to the function near a point" is definition (6) in Thurston's [**On Proof and Progress in Mathematics**](http://arxiv.org/pdf/math/9404236.pdf) (p. 3). (Definition (37) on the next page is a classic...) – Benjamin Dickman May 14 '16 at 07:48
  • I suppose you mean $R \in \mathcal{O}((x-a)^2)$ there? – leftaroundabout May 14 '16 at 15:53
  • @leftaroundabout No. That would imply that $f$ is twice differentiable. Little oh of $x-a$ is a less strict assumption. –  May 14 '16 at 15:55
  • Ok, true, this appears to be sufficient. But the order of quantors still seems funny... for each $x$ there exists _a function of $x$_? – leftaroundabout May 14 '16 at 16:06
  • 31
    Now I'm tempted to downvote all of your answers XD – Oriol May 14 '16 at 23:17
  • 1
    Remove the first paragraph, please. It has nothing to do with the answer and doesn't belong there. – Polygnome May 15 '16 at 18:07
  • 9
    That paragraph is just meant to be a little bit of humorous self-deprecation. Get over it. ;P –  May 15 '16 at 18:21
  • 3
    I only learned of Bye_World's fantastic post on linear/geometric algebra *because* of the first paragraph; definitely a keeper (and, funnily enough, I was the last person to edit the linked question. Small e-world!). – pjs36 May 15 '16 at 18:53
  • Hey, I'm late to the game -- but could you extend this logic to prove in the multivariate case that the Jacobian is the unique linear map (if one exists, which implies differentiability) that approximates f(x, y) in a way such that the error tends to zero faster than (dx, dy)? – mrmagicfluffyman Mar 14 '22 at 13:29

There is a sense in which the derivative is the best linear approximation. You just have to define "best" approximation in a proper way, taking into account that the derivative is a very local property. In particular, suppose we are trying to approximate $f$ at $x_0$. Then, we make the following definition:

A function $g$ is at least as good of an approximation as $h$ if there is some $\varepsilon>0$ such that for any $x$ with $|x-x_0|<\varepsilon$ we have that $|g(x)-f(x)|\leq |h(x)-f(x)|$.

This is to say that, when we compare two functions, we only look at arbitrarily small neighborhoods of the point at which we are approximating. This defeats your strategy - if you take the tangent line and compare it to a secant line passing through $(a,f(a))$, this approximation will exclude $a$ from consideration by making $\varepsilon$ small enough. Essentially, the important thing is that you have to fix $\varepsilon$ after you fix the two functions which you want to compare. This is only a partial order (well, and not quite that), so sometimes there is no best approximation.

However, we have two theorems:

  • $f$ is differentiable at $x_0$ if and only if there is a linear function $g$ which is at least as good of an approximation as any other linear $h$.

  • If $f$ is differentiable at $x_0$, then $g(x)=f(x)+(x-x_0)f'(x)$ is the best linear approximation of $f$.

meaning this definition is equivalent to the usual one. Interestingly, we get the condition of continuity at $x_0$ if we ask for the best constant approximation to exist.

Gabriel Romon
  • 32,348
  • 5
  • 56
  • 135
Milo Brandt
  • 58,703
  • 5
  • 98
  • 184
  • do you have references for two theorems you have mentioned? – Daniels Krimans Jul 30 '20 at 01:15
  • @DanielsKrimans Alas, I do not! They are easy enough to prove from definition, but this line of thinking seems rarely expressed in literature. – Milo Brandt Jul 30 '20 at 02:09
  • thanks for quick reply! If results are sufficiently easy, could you please look at this [question](https://math.stackexchange.com/questions/3772876/if-there-is-a-linear-function-g-which-is-at-least-as-good-of-an-approximation?noredirect=1#comment7765508_3772876)? – Daniels Krimans Jul 30 '20 at 02:32
  • @DanielsKrimans Oh, I had intended this post in the context of discussing functions of a single variable - the proofs I had in mind rely on some one-dimensional properties, but it seems plausible that the first bulleted theorem translates into multiple dimensions, but the analysis looks too tricky to do right now. (The second theorem mentioned here doesn't seem likely to hold for maps $\mathbb R^2\rightarrow\mathbb R$). I'll think some more and see if I can resolve that question. – Milo Brandt Jul 30 '20 at 03:00
  • Thanks, I really appreciate your time – Daniels Krimans Jul 30 '20 at 03:01
  • can you please also explain single variable case? – Daniels Krimans Aug 01 '20 at 04:20

I'll first give a intuitive answer, then an analytic answer.

Intuitively, the tangent goes in the same direction as the function, following it as closely as possible for a line. Any other line immediately starts to diverge from the function.


Consider the Taylor aproximation at $x$: $f(x+h) =f(x)+hf'(x)+h^2f''(x)/2+... $.

This means that, for small $h$ $f(x+h) \approx f(x)+hf'(x)+h^2f''(x)/2 $ so that the error $E(x, h) =f(x+h)- (f(x)+hf'(x)) $ is about $ h^2f''(x)/2 $.

Now consider any other line through $(x, f(x))$ with slope $s$, with $s \ne f'(x)$. At $x+h$, its value is $f(x)+sh$, so its error, $e(x, h)$ is $e(x, h, s) =f(x+h)-(f(x)+sh) $.

Since $f(x+h)-f(x) \approx hf'(x)+h^2f''(x)/2 $,

$\begin{array}\\ e(x, h, s) &=f(x+h)-(f(x)+sh)\\ &\approx hf'(x)+h^2f''(x)/2-sh\\ &= h(f'(x)-s)+h^2f''(x)/2\\ \end{array} $

so that $\dfrac{E(x, h)}{e(x, h, s)} \approx \dfrac{h^2f''(x)/2}{h(f'(x)-s)+h^2f''(x)/2} = \dfrac{hf''(x)/2}{f'(x)-s+hf''(x)/2} $.

Since $s \ne f'(x)$, as $h \to 0$, the numerator of thie ratio of errors goes to zero, while the denominator stays bounded away from zero.

Therefore the error of the tangent goes to zero faster than the error in any other line through the point.

That is why the tangent is the best linear approximation to the curve.

marty cohen
  • 101,285
  • 9
  • 66
  • 160
  • Great, so the key point of this proof is observing that in the denominator we have $f′(x)−s$ = 0 when we have the derivative. And because we take the ratio of the numerator (true derivative) and bottom (any other derivative), we can make an intuitive argument that when this ratio is < 1, then our error in the true derivative is smaller. But in theory we could minimize the denominator? For instance, pick s such that the denominator d is 0 – information_interchange Mar 18 '20 at 19:14
  • 1
    I always need to add some absolute value signs. But, perhaps I purposely leave my answers imperfect so that others will examine them carefully and therefore understand the problem and my answer better. Or, perhaps, well, something else. Perhaps. – marty cohen Mar 18 '20 at 19:37

Think about the derivative in the sense that If you zoom in very close to any differentialiable (smooth curve) you'll get a straight line. The slope of that line is the derivative and is the best linear approximation for the function near that point. If any linear approximations fit better zoomed in that closely then by definition it would be closer to the slope of the function at that point then the derivative of the function at that point. This is then impossible.

shai horowitz
  • 2,038
  • 7
  • 33

I think you might be confusing the derivative as a linear operator vs. the derivative evaluated at a point (a linear functional). The derivative itself does not approximate anything, it just gives you a function that tells you the rate of change of the original function for every value of x in the domain. Now, when you evaluate the derivative at a single point $x=a$, you are still one step removed from an approximation of your original function $f$ in the small neighborhood of $a$. This is because evaluating the derivative only gives you the rate of change for the function in the small neighborhood of $a$. Then you have to perform an affine transformation (i.e. a translation) of that value to arrive at your approximation.

So, when you think derivative, think $D:C^{k} \to C^{k-1}$ which is given by


for some $f \in C^{k}$ and then for the derivative of a function evaluated at a specific point, think about a linear functional $E:C^{k} \to \mathbb{R}$ given by $$E[f']_{a}=f'(a)$$

  • 6,602
  • 1
  • 10
  • 18
  • 1,219
  • 7
  • 13

This depends a lot on how we measure error. So we could turn the question around and ask for what definition of error will a first order Taylor approximation give the least error? You already have gotten good explanations from others on this, considering what happens as we take limits close to the point. So I could maybe contribute by adding something new. Say we want to find $p$ to minimize a norm of the difference of the functions.


However this $\|\cdot\|$ can be defined in many ways! Some popular ones are the weighted schatten norms:

$$ \|f(x)-p(x)\|_k = \sqrt[k]{\int_{-\infty}^\infty w(x)\left|f(x)-p(x)\right|^kdx}$$

We would be getting a solution close to gnasher729 $f(x) = x^2$ for example if we pick $w(x)$ to be a box function and let $k$ take values which approximate the max-norm which simply is the max absolute value on an interval.

I wonder what choices of $w(x)$ and $k$ will give us the first order Taylor approximation as the solution!

In fact in engineering, how to measure the error in a useful way can often be one of the toughest considerations.

  • 24,082
  • 9
  • 33
  • 83
  • 1
    To get the best approximation, I believe you take any $k$, and look at $w$ to be $1/2\epsilon$ between $[-\epsilon, \epsilon]$ and $0$ outside. Then perform the integral and then take $\epsilon$ to $0$. If you then minimize over all functions of the form $a_0+a_1x$ it should work. In other words, we're looking for any measure of how good the approximation is (so any $k$ is fine) as long as you restrict your attention to the limit as the interval shrinks to nothing. – Joel May 16 '16 at 17:47
  • Yes the shrinkage procedure is the central point in our job here, isn't it..? – mathreadler May 16 '16 at 19:15

Let's try to find the best fitting line to the parabola $$\text{$y = f(x) = x^2$ at the point $(1,1)$ of $f$.}$$ We require that $$f(1) = L(1).$$ So the line must look like $$L(x) = m(x-1) + 1 = mx - (m-1).$$ The difference between the two curves will be $$E(x) = f(x) - L(x) = x^2 - mx + (m - 1).$$ In order to emphasize that we are interested in the behaviour of $E(x)$ near $x = 1$ we consider the function $$E(1 + h) = (1+h)^2 - m(1+h) + (m - 1) = (2-m)h + h^2.$$

The term $(2-m)h \;$ is an $``\text{order of}\, h"\,$ error and is expressed as $O(h)$, pronounced big $O$ of $h$.

The term $h^2 \,$ is an $``\text{order of}\; h^2"$ error and is expressed as $O(h^2)$, pronounced big $O$ of $h^2$.

The basic idea is that, if $h$ is small, then $h^2$ is an order of magnitude smaller.

We see when $m \ne 2$ that the error is $O(h)$ and, if $m=2$, then the error is $O(h^2)$. It is in this sense that we say the line $L(x) = 2x - 1$ is the "best linear fit line" to $y = x^2$ at the point $(1,1)$.

Hence the best linear fit line, $y = mx + b$, to the curve $y = f(x)$ at the point $(x_0, f(x_0))$ must have these two properties:

  1. $f(x_0) = L(x_0)$
  2. $f(x_0 + h) = L(x_0 + h) + O(h^2)$

To have $L(x_0) = f(x_0)$, we need to have $L(x) = m(x - x_0) + f(x_0)$ If we define $m = f'(x_0)$, then we get $L(x) = f'(x_0)(x - x_0) + f(x_0)$. Hence conditions $(1.)$ and $(2.)$ can be combined to

  1. $f(x_0 + h) = f(x_0) + h f'(x_0) + O(h^2)$.

and that is the sense by which $L(x) = f'(x_0)(x - x_0) + f(x_0)$ is the best linear approximation to the curve $y = f(x)$ at the point $(x_0, f(x_0))$.

Steven Alexis Gregory
  • 25,643
  • 4
  • 42
  • 84

Take the function f (x) = $x^2$. At x = 0, the derivative gives you the function g (x) = 0. On every interval $[-a, +a]$ the function g (x) = $a^2/2$ will give a better approximation with a maximum error of $a^2/4$ instead of $a^2$.

However, that approximation will be worse on any interval smaller than $[-a/2, +a/2]$. g (x) = 0 will beat any other approximaton on any small enough interval.

  • 8,131
  • 16
  • 33

Here's a simple explanation of what's wrong with your argument. You're not understanding what is meant by "near".

The claim isn't that

for a given small interval it is the best approximation.

But this is what you are arguing --- given a $\delta>0$, it is true that you can find a better (or at least as good) approximation in $(x_0-\delta,x_0+\delta)$ with a different line. But what happens if $\delta$ shrinks? After all, what you call near, someone with a different perspective would call far. So maybe you found a good approximation on the solar system scale, but I'm a geologist, so we need to find a good one on my planetary scale (yours will fail now), but then we talk to a microbiologist and my approximation is no good now, (and of course a string theorist is next).

Really the claim is

you cannot find a better approximation near $x_0$

and here "near $x_0$" is a key part of the definition. We say approximation $A$ is better than approximation $B$ near $x_0$ if I can find a small enough $\delta$ such that for any $\epsilon<\delta$ approximation $A$ is always better than $B$ in $(x-\epsilon, x+\epsilon)$.

If you take any one interval and the approximation you described, you'll find that for a small enough interval that approximation is not as good as the tangent.

  • 485
  • 2
  • 10

Let $y=f(x)$ be a (differentiable) function that we are trying to approximate around the point $P(a,f(a))$. The simplest way to approximate its behaviour around that point is by fitting a linear function $g(x)=mx+c$ to it. We can then define 'the best linear approximation' to $f$ as the function $g$ that has the following property: $$ \lim_{x \to a}\frac{f(x)-g(x)}{x-a}=0 \, . $$ What this criterion tries to capture is the 'relative error' of $g$. If $x$ is very close to $a$, then $g(x)$ should be closer still to $f(x)$. Simple algebraic manipulation shows that the only function $g$ that satisfies this property is the tangent to $P$. Let $$ h(x)=\frac{f(x)-g(x)}{x-a} \, . $$ Then, $f(x)-g(x)=h(x)(x-a)$. This means that $$ \lim_{x \to a}f(x)-g(x) = \lim_{x \to a}h(x) \cdot \lim_{x \to a}(x-a)=0 \, . $$ Therefore, $$ \lim_{x \to a}f(x)=\lim_{x \to a}g(x) \, , $$ which implies $f(a)=g(a)$ since both $f$ and $g$ are differentiable and hence continuous. Unsurprisingly, the 'best linear approximation' of a function around the point $x=a$ should be exactly equal to the function at the point $x=a$. Using the point-slope form of the equation of a line, we find that $$ g(x) = m(x-a) + g(a) = m(x-a) + f(a) \, . $$ We are now tasked with proving that $m=f'(a)$. Luckily, this is not too difficult: \begin{align} & \lim_{x \to a}\frac{f(x)-g(x)}{x-a}=0 \\[4pt] \implies & \lim_{x \to a}\frac{f(x)-g(x)}{x-a}=0 \\[4pt] \implies & \lim_{x \to a}\frac{f(x)-f(a)+f(a)-g(x)}{x-a}=0 \\[4pt] \implies & \lim_{x \to a}\frac{f(x)-f(a)}{x-a} + \lim_{x \to a}\frac{f(a)-g(x)}{x-a}=0 \\[4pt] \implies & \lim_{x \to a}\frac{f(x)-f(a)}{x-a} + \lim_{x \to a}\frac{g(a)-g(x)}{x-a}=0 \\[4pt] \implies & f'(a) - g'(a) = 0 \\[4pt] \implies & f'(a) = g'(a) \end{align} Since $g'(a)=m$, we find that $g$ must have the equation $$ g(x) = f'(a)(x-a) + f(a) \, . $$ But this is the equation of the tangent to $P$, and so, in this sense, the derivative gives the best linear approximation of $f(x)$ around a certain point.

  • 14,185
  • 2
  • 28
  • 65

First off, I'm surprised that of all the answers here, only one seems to point out an error (or at best an abuse of language), namely that the derivative of a function $f$ approximates it. This is flat out false. That is not what the derivative does. Perhaps this is where the main confusion comes from.

But what is the derivative, what does it do? Simply put, the derivative of a real-valued function $f$ of a real variable $x$ at a point $x_0\in\operatorname{Dom}(f)$, if it exists, is a number that quantifies how that function changes $\operatorname{wrt}$ the variable $x$ at that point $x_0$. You are no doubt aware of the many applications of this concept.

However, in trying to compute instantaneous rates of change, we come against the difficulty that the only functions we can directly do this for are linear functions, given that they have the same rate of change at every point (their graphs are nonvertical straight lines in the Cartesian plane). As usual then, we seek to understand any function in terms of linear functions, cumbersome as it might initially seem. However, this method is not applicable to all possible functions, but only to those that possess a remarkable property, namely that they behave like a linear function as you zoom in closer and closer on them at some point $x_0$ in their domain (obviously, if follows that such functions must be necessarily continuous); in other words, the graphs of such functions look like linear functions as you zoom in closer and closer on them about some fixed point. If this behaviour occurs at every point $x$ where these functions are defined, we say that they are differentiable or smooth.

So for these differentiable functions, how should we understand their derivatives in terms of that of linear functions? Well, we simply define the derivative of $f$ at the point $x_0$ to be the derivative of the linear function that $f$ resembles as $x\to x_0$. It is easy to see that this linear function, if it exists, must be unique, since there obviously cannot be a linear function that behaves like $f$ better than the limiting one at $x_0$. Now this is where the idea of linear approximations come from -- so that it is a linear function that is the approximation, and not its derivative.

Once again therefore, the derivative is not an approximation at all. We say that the derivative of $f$ at $x_0$ is the derivative of the best approximating linear function to $f$ at $x_0$. So that the relationship is between two derivatives at a point, two numbers; not between a derivative at a point, a number, and some linear function. Thus at the point $x_0$, the two derivatives are identified whereas the two functions are also identified (not function/derivative or derivative/function).

  • 12,848
  • 1
  • 15
  • 26