I'm a newbie on Spark and need to parallelizePairs()
(working on Java).
First, I've started my driver with:
SparkSession spark = SparkSession
.builder()
.appName("My App")
.config("driver", "org.postgresql.Driver")
.getOrCreate();
But spark
don't have the function I need. Just parallelize()
thru spark.sparkContext()
Now I'm tempted to add
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("My App");
JavaSparkContext context = new JavaSparkContext(sparkConf);
This way, context have the function I need but I'm very confusing here.
First, I never needed JavaSparkContext
because I'm running using spark-submit
and setting the master address there.
Second, why spark.sparkContext()
is not the same of JavaSparkContext
and how to get it using the SparkSession
?
If I'm passing the master in command line, must I set sparkConf.setMaster( '<master-address-again>' )
?
I already read this: How to create SparkSession from existing SparkContext and undesrtood the problem but I realy need the builder way because I need to pass the .config("driver", "org.postgresql.Driver")
to it.
Please some light here...
EDIT
Dataset<Row> graphDatabaseTable = spark.read()
.format("jdbc")
.option("url", "jdbc:postgresql://192.168.25.103:5432/graphx")
.option("dbtable", "public.select_graphs")
.option("user", "postgres")
.option("password", "admin")
.option("driver", "org.postgresql.Driver")
.load();
SQLContext graphDatabaseContext = graphDatabaseTable.sqlContext();
graphDatabaseTable.createOrReplaceTempView("select_graphs");
String sql = "select * from select_graphs where parameter_id = " + indexParameter;
Dataset<Row> graphs = graphDatabaseContext.sql(sql);