Field "features" does not exist. SparkML

Question

I am trying to build a model in Spark ML with Zeppelin. I am new to this area and would like some help. I think i need to set the correct datatypes to the column and set the first column as the label. Any help would be greatly appreciated, thank you

val training = sc.textFile("hdfs:///ford/fordTrain.csv")
val header = training.first
val inferSchema = true  
val df = training.toDF

val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)

 val lrModel = lr.fit(df)

// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: ${lrModel.interceptVector}")

A snippet of the csv file i am using is:

IsAlert,P1,P2,P3,P4,P5,P6,P7,P8,E1,E2
0,34.7406,9.84593,1400,42.8571,0.290601,572,104.895,0,0,0,

Hi, Im working in hortonworks sandbox. Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set Spark1 will be picked by default. The default is 1.6.3 — Young4844, Jul 06 '17 at 15:09
I'd suggest taking look at the [example provided in the documentation](https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#logistic-regression) for your version of Spark. In there is states that you can "Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala" in the Spark repo." — Jeremy, Jul 06 '17 at 17:02

score 14 · Accepted Answer · answered Jul 07 '17 at 06:06

As you have mentioned, you are missing the features column. It is a vector containing all predictor variables. You have to create it using VectorAssembler.

IsAlert is the label and all others variables (p1,p2,...) are predictor variables, you can create features column (actually you can name it anything you want instead of features) by:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

//creating features column
val assembler = new VectorAssembler()
  .setInputCols(Array("P1","P2","P3","P4","P5","P6","P7","P8","E1","E2"))
  .setOutputCol("features")


val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)
  .setFeaturesCol("features")   // setting features column
  .setLabelCol("IsAlert")       // setting label column

//creating pipeline
val pipeline = new Pipeline().setStages(Array(assembler,lr))

//fitting the model
val lrModel = pipeline.fit(df)

Refer: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler.

Field "features" does not exist. SparkML

1 Answers1

Linked