4

I'm trying really hard to do a Gaussian Mixture with sklearn but I think I'm missing something because it definitively doesn't work.

My original datas look like this:

Genotype LogRatio  Strength
AB       0.392805  10.625016
AA       1.922468  10.765716
AB       0.22074   10.405445
BB       -0.059783 10.625016

I want to do a Gaussian Mixture with 3 components = 3 genotypes (AA|AB|BB). I know the weight of each genotype, the mean of Log Ratio for each genotype and the mean of Strength for each genotype.

wgts = [0.8,0.19,0.01]  # weight of AA,AB,BB
means = [[-0.5,9],[0.5,9],[1.5,9]] # mean(LogRatio), mean(Strenght) for AA,AB,BB 

I keep columns LogRatio and Strength and create a NumPy array.

datas = [[  0.392805  10.625016]
         [  1.922468  10.765716]
         [  0.22074   10.405445]
         [ -0.059783   9.798655]]

Then I tested the function GaussianMixture from mixture from sklearn v0.18 and tried also the function GaussianMixtureModel from sklearn v0.17 (I still don't see the difference and don't know which one to use).

gmm = mixture.GMM(n_components=3) 
OR
gmm = mixture.GaussianMixture(n_components=3)

gmm.fit(datas)

colors = ['r' if i==0 else 'b' if i==1 else 'g' for i in gmm.predict(datas)]
ax = plt.gca()
ax.scatter(datas[:,0], datas[:,1], c=colors, alpha=0.8)
plt.show()

This is what I obtain and this is a good result but it changes every time because initial parameters are calculated differently each run

enter image description here

I would like to initialize my parameters in the gaussianMixture or GMM function but I don't understand how I have to formate my datas: (

Gabriel
  • 32,750
  • 58
  • 187
  • 337
Elysire
  • 603
  • 7
  • 19

1 Answers1

1

It is possible to control the randomness for reproducibility of the results by explicitly seeding the random_state pseudo random number generator.

Instead of :

gmm = mixture.GaussianMixture(n_components=3)

Do :

gmm = mixture.GaussianMixture(n_components=3, random_state=3)

random_state must be an int : I've randomly set it to 3 but you can choose any other integer.

When running multiple times with the same random_state, you will get the same results.

MMF
  • 4,811
  • 3
  • 12
  • 18
  • 1
    Yes you are right I have the same results with random_state but I still want to fix my initial parameters with "weights_init" and "means_init" and I can't use my list of weights [0.8,0.19,0.01] directly even in array shape. I don't know how to formate my weights and means. – Elysire Nov 08 '16 at 15:35
  • 1
    I don't understand why you can't put your lists of `weights` and `means`. According to the documentation It should work if passed like that. What's the error thrown ? – MMF Nov 08 '16 at 15:37
  • The error is "ValueError: The parameter 'weights' should be normalized, but got sum(weights) = 0.98377" – Elysire Nov 08 '16 at 15:44
  • Yes sorry it was not the real one in my post. I use this one : [ 0.5194072 0.38038109 0.08398024] I checked the weigths calculated automatically by the GaussianMixture function and it's not very far from mine but not the same. – Elysire Nov 08 '16 at 16:03
  • `np.sum([ 0.5194072, 0.38038109, 0.08398024]) = 0.98376852999999997` : the sum must be `1.0`. – MMF Nov 08 '16 at 16:14
  • Right ! It was a bug in my datas, it works with a good array where the sum = 1 and I can use it in my function ! Thank you ! – Elysire Nov 08 '16 at 16:22