14

I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you

val training = sc.textFile("hdfs:///ford/fordTrain.csv")
val header = training.first
val inferSchema = true  
val df = training.toDF

val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)

 val lrModel = lr.fit(df)

// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: ${lrModel.interceptVector}")

A snippet of the csv file i am using is:

IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2
0,34.7406,9.84593,1400,42.8571,0.290601,572,104.895,0,0,0,
MichaelChirico
  • 31,197
  • 13
  • 98
  • 169
Young4844
  • 227
  • 1
  • 3
  • 12
  • what version of spark are you using ? – koiralo Jul 06 '17 at 14:06
  • Hi, Im working in hortonworks sandbox. Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set Spark1 will be picked by default. The default is 1.6.3 – Young4844 Jul 06 '17 at 15:09
  • I'd suggest taking look at the [example provided in the documentation](https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#logistic-regression) for your version of Spark. In there is states that you can "Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala" in the Spark repo." – Jeremy Jul 06 '17 at 17:02

1 Answers1

14

As you have mentioned, you are missing the features column. It is a vector containing all predictor variables. You have to create it using VectorAssembler.

IsAlert is the label and all others variables (p1,p2,...) are predictor variables, you can create features column (actually you can name it anything you want instead of features) by:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

//creating features column
val assembler = new VectorAssembler()
  .setInputCols(Array("P1","P2","P3","P4","P5","P6","P7","P8","E1","E2"))
  .setOutputCol("features")


val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)
  .setFeaturesCol("features")   // setting features column
  .setLabelCol("IsAlert")       // setting label column

//creating pipeline
val pipeline = new Pipeline().setStages(Array(assembler,lr))

//fitting the model
val lrModel = pipeline.fit(df)

Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.

vdep
  • 3,193
  • 4
  • 24
  • 46