3

Although I have read lots of posts about fitting distributions in python, I am still confused about usage floc and fscale parameters. For general information I mainly used this, this and this sources.

I know, that given distribution lets say f(x) becomes more general distribution when using loc and scale parameters, which can be described by formula f(x) = f((x-loc)/scale).

In scipy, we have to choice. When fitting a distribution, using formula distr.fit(x), the initial guess of loc parameter is 0 and initial guess of fscale parameter is 1 (so that we assume that the parametrized distribution is close to nonparametrized distribution). We can also force scipy to fit 'original' distribution f(x) using distr.fit(x, floc = 0, fscale = 1).

My question is: is there any general advice when to force scipy to fit 'original distribution' besides the 'parametrized one'?

Here is the example:

# generate some data
from scipy.stats import lognorm, fisk, gamma
from statsmodels.distributions.empirical_distribution import ECDF
import numpy as np
import matplotlib.pyplot as plt

x1 = [18. for i in range(36)]
x2 = [19. for i in range(17)]
x3 = [22. for i in range(44)]
x4 = [27. for i in range(63)]
x5 = [28.2 for i in range(8)]
x6 = [32. for i in range(104)]
x7 = [32.6 for i in range(29)]
x8 = [33. for i in range(85)]
x9 = [33.4 for i in range(27)]
x10 = [34.2 for i in range(49)]
x11 = [36. for i in range(99)]
x12 = [36.2 for i in range(35)]
x13 = [37. for i in range(98)]
x14 = [38. for i in range(25)]
x15 = [38.4 for i in range(39)]
x16 = [39. for i in range(25)]
x17 = [42. for i in range(54)]

# empirical distribution function
xp = x1 + x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15+x16+x17
yp = ECDF(xp) 

# fit lognormal distribution with parametrization
pars1 = lognorm.fit(xp)
# fit lognormal distribution with floc = 0
pars2 = lognorm.fit(xp, floc = 0)
#plot the result
X = np.linspace(min(xp), max(xp), 10000)
plt.plot(yp.x, yp.y, 'ro')
plt.plot(X, lognorm.cdf(X, pars1[0], pars1[1], pars1[2]), 'b-')
plt.plot(X, lognorm.cdf(X, pars2[0], pars2[1], pars2[2]), 'g-')
plt.show()

image1

#fit the gamma distribution
pars1 = gamma.fit(xp)
pars2 = gamma.fit(xp, floc = 0)
#plot the result
X = np.linspace(min(xp), max(xp), 10000)
plt.plot(yp.x, yp.y, 'ro')
plt.plot(X, gamma.cdf(X, pars1[0], pars1[1], pars1[2]), 'b-')
plt.plot(X, gamma.cdf(X, pars2[0], pars2[1], pars2[2]), 'g-')
plt.show()      

image2

As you can see, the floc = 0 improved a lot the fit in lognorm case, in gamma case it didint change the fit at all.

Sorry for long demontration, here is my question again: Is there any general advice when to specify floc = 0 and fscale = 1 and when to use custome loc = 0 and scale = 1?

Bobesh
  • 987
  • 2
  • 11
  • 25

1 Answers1

2

Short answer

Provide some guess-estimates for loc and scale whenever you are able. Provide floc and fscale only when you actually need this for your subsequent use of the model; that is, if an answer with, say, distribution mean different from 0 is just not acceptable to you.

For example, if you model elastic force by Hooke's law F = k*x and want to find k from experimental force F and deformation x, there is no use in fitting a general linear model k*x+b; we know that zero force produces zero deformation. Any nonzero value of b may achieve better fit but only because it follows experimental errors better, which isn't the goal. So this is a situation where we want to force a certain parameter to be zero.

Never use floc or fscale if you just want to improve the fit; use loc and scale instead.

Explanation

Fitting a distribution to data is a multivariable optimization problem. Such problems are difficult and solvers frequently fail when the starting point is far from the optimal. If floc gives better result than unconstrained fit, that only means the unconstrained fit failed.

To improve the outcome, you should provide tentative loc and scale parameters whenever you are able to come up with something reasonable.

In your lognormal example, you compare not giving any hint to imposing the restriction floc=0. But the best strategy is just to give a hint with loc=0:

pars1 = lognorm.fit(xp, loc=0)

The resulting blue curve is better than the green one with floc=0.

lognormal

Of course it is better. loc=0 points the optimizer to a pretty good place to start, and lets it work from there. floc=0 points the optimizer to a pretty good place to start, but then tells it to stay there.

  • I tought that if you dont specify loc, than loc = 0 is automatically given to the fit function, isnt it? – Bobesh Mar 27 '18 at 19:57
  • Yep, in its official site is fit function described as fit(data, s, loc = 0, scale = 1) so why is your result different? – Bobesh Mar 27 '18 at 20:22
  • 2
    That's what the documentation says, but the source reads differently. The class `lognorm_gen` inherits method `fit` from `rv_continuous`. The method picks default parameters from `_startfit`, which by default (unless overridden in a class) are 1.0 for every parameter. [Ref](https://github.com/scipy/scipy/blob/v1.0.0/scipy/stats/_distn_infrastructure.py#L2023). lognorm does not override this default. –  Mar 27 '18 at 21:09
  • 1
    Indeed, the bad fit that you see is also produced by `pars1 = lognorm.fit(xp, 1.0)`, whereas `pars1 = lognorm.fit(xp, 0.0)` gives a good fit. The handling of shape argument and loc/scale argument in lognormal is complicated, because there is a redundancy ([ref](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.lognorm.html#scipy.stats.lognorm)). It seems that just the fact that `loc` was provided yields a different outcome that if it was not, the way the logic of `fit` method goes. –  Mar 27 '18 at 21:23