1

I am trying to incorporate the prior settings of my dependent variable in my in using the -function. The data-set I am using is created to predict churn.

So far I am using the function below:

V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family = 
              binomial(link='logit'))

What I am looking for is how the weights function works and how to include it in the function or if there is another way to incorporate this. The dependent variable is a nominal variables with the options 0 or 1. The data set is imbalanced in a way that only 10 % has a value of 1 on the dependent variable CH1 and the other 90% has a value of 0. Therefore the weights are (0.1, 0.9)

My dataset Is build-up in the following manner:

Dataset preview

Where the independent variables vary in data type between continues and class variables and

M--
  • 18,939
  • 7
  • 44
  • 76
Nienke Bos
  • 39
  • 3
  • 1
    You should provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – M-- Feb 25 '20 at 16:12

2 Answers2

0

In your dataset trainingset create a column called weights_col that contains your weights (.1, .9) and then run

V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family = binomial(link='logit'), weights = weights_col)
0

Although the ratio of 0 to 1s is 1:9, it does not mean the weights are 0.1 and 0.9. The weights decides how much emphasis you want to give observation compared to the others.

And in your case, if you want to predict something, it is essential you split your data into train and test, and see what influence the weights have on prediction.

Below is using the pima indian diabetes example, I subsample the Yes type such that the training set has 1:9 ratio.

set.seed(111)
library(MASS)
# we sample 10 from Yes and 90 from No
idx = unlist(mapply(sample,split(1:nrow(Pima.tr),Pima.tr$type),c(90,10)))
Data = Pima.tr
trn = Data[idx,]
test = Data[-idx,]

 table(trn$type)

 No Yes 
 90  10 

Lets try regressing it with weight 9 if positive, 1 if negative:

library(caret)
W = 9
lvl = levels(trn$type)
#if positive we give it the defined weight, otherwise set it to 1
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
# we test it on the test set
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  34  26
       Yes  8  32

You can see from above, you can see it's doing ok, but you are missing out on 8 positives and also falsely labeling 26 false positives. Let's say we try W = 3

W = 3
lvl = levels(trn$type)
fit_wts = ifelse(trn$type==lvl[2],W,1)
fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
pred = factor(pred,levels=lvl)
confusionMatrix(pred,test$type,positive=lvl[2])

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  39  30
       Yes  3  28

Now we manage to get almost all the positive calls correct.. But still miss out on a lot of potential "Yes". Bottom line is, code above might work, but you need to do some checks to figure out what is the weight for your data.

You can also look around the other stats provided by confusionMatrix in caret to guide your choice.

StupidWolf
  • 34,518
  • 14
  • 22
  • 47