31

I have a Spark application which using Spark 2.0 new API with SparkSession. I am building this application on top of the another application which is using SparkContext. I would like to pass SparkContext to my application and initialize SparkSession using existing SparkContext.

However I could not find a way how to do that. I found that SparkSession constructor with SparkContext is private so I can't initialize it in that way and builder does not offer any setSparkContext method. Do you think there exist some workaround?

Stefan Repcek
  • 2,285
  • 2
  • 18
  • 26
  • I'm not very sure but according to my knowledge ter is no workaround – Balaji Reddy Mar 21 '17 at 18:22
  • yea :( so If there is no workaround there are two options left: using SparkContext in my application or add support for sparkSession to application I am building on the top (it is spark-jobserver, I am using their branch spark-2.0-preview however they still use sparkContext) – Stefan Repcek Mar 21 '17 at 18:30
  • You only need to add support for an external SparkContext to the application and access the session.sparkContext. Shouldn't be a big issue. – matfax Mar 21 '17 at 22:21
  • can you explain more by what you mean "add support for an external SparkContext" I read you should use just one instance of sparkcontext – Stefan Repcek Mar 21 '17 at 23:31
  • I suppose the application creates its own SparkContext. Since you only want one SparkContext (for good reasons), you need to add a parameter to the application's constructor or builder that accepts the external SparkContext that you already created using the session builder. – matfax Mar 22 '17 at 01:10
  • the problem is the application I am using (spark-jobserver) don't allow to pass my sparkContext, it creates its own – Stefan Repcek Mar 22 '17 at 11:03
  • That's why you need to edit the code of spark-jobserver (the application) not to create its own. Fork it, make your modifications, and publish it (e.g., with Jitpack). As Balaji said, there is no workaround. The only alternative is to edit Spark itself, which I wouldn't recommend. – matfax Mar 22 '17 at 12:12
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/138731/discussion-between-matthias-fax-and-stevesk). – matfax Mar 22 '17 at 12:27

6 Answers6

22

Deriving the SparkSession object out of SparkContext or even SparkConf is easy. Just that you might find the API to be slightly convoluted. Here's an example (I'm using Spark 2.4 but this should work in the older 2.x releases as well):

// If you already have SparkContext stored in `sc`
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()

// Another example which builds a SparkConf, SparkContext and SparkSession
val conf = new SparkConf().setAppName("spark-test").setMaster("local[2]")
val sc = new SparkContext(conf)
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()

Hope that helps!

Rishabh
  • 1,673
  • 1
  • 16
  • 18
21

Like in the above example you cannot create because SparkSession's constructor is private Instead you can create a SQLContext using the SparkContext, and later get the sparksession from the sqlcontext like this

val sqlContext=new SQLContext(sparkContext);
val spark=sqlContext.sparkSession

Hope this helps

philantrovert
  • 8,704
  • 3
  • 25
  • 48
Partha Sarathy
  • 219
  • 2
  • 4
  • 4
    When I do this in Spark 2.2, it says SQLContext is deprecated and to use SparkSession.Builder() instead – covfefe Mar 14 '18 at 22:35
  • Correct. In Spark 2, SQLContext is deprecated because everything is consolidated to the SparkSession, which is why you'd just use `SparkSession.sql()` to execute your Spark SQL, `SparkSession.sparkContext` to get the context if you need it, etc. If you're looking for Hive support (previously HiveContext), you do something like `val spark = SparkSession.builder().enableHiveSupport()` – Anthony May 22 '18 at 19:23
14

Apparently there is no way how to initialize SparkSession from existing SparkContext.

Stefan Repcek
  • 2,285
  • 2
  • 18
  • 26
6
public JavaSparkContext getSparkContext() 
{
        SparkConf conf = new SparkConf()
                    .setAppName("appName")
                    .setMaster("local[*]");
        JavaSparkContext jsc = new JavaSparkContext(conf);
        return jsc;
}


public  SparkSession getSparkSession()
{
        sparkSession= new SparkSession(getSparkContext().sc());
        return sparkSession;
}


you can also try using builder  

public SparkSession getSparkSession()
{
        SparkConf conf = new SparkConf()
                        .setAppName("appName")
                        .setMaster("local");

       SparkSession sparkSession = SparkSession
                                   .builder()
                                   .config(conf)
                                  .getOrCreate();
        return sparkSession;
}
Mostwanted Mani
  • 950
  • 2
  • 12
  • 24
  • 1
    in your second method you don't use any spark context, in scala I can't construct SparkSession like in your getSparkSession() – Stefan Repcek May 10 '17 at 20:30
4
val sparkSession = SparkSession.builder.config(sc.getConf).getOrCreate()
lostsoul29
  • 654
  • 1
  • 9
  • 19
1

You would have noticed that we are using SparkSession and SparkContext, and this is not an error. Let's revisit the annals of Spark history for a perspective. It is important to understand where we came from, as you will hear about these connection objects for some time to come.

Prior to Spark 2.0.0, the three main connection objects were SparkContext, SqlContext, and HiveContext. The SparkContext object was the connection to a Spark execution environment and created RDDs and others, SQLContext worked with SparkSQL in the background of SparkContext, and HiveContext interacted with the Hive stores.

Spark 2.0.0 introduced Datasets/DataFrames as the main distributed data abstraction interface and the SparkSession object as the entry point to a Spark execution environment. Appropriately, the SparkSession object is found in the namespace, org.apache.spark.sql.SparkSession (Scala), or pyspark.sql.sparkSession. A few points to note are as follows:

In Scala and Java, Datasets form the main data abstraction as typed data; however, for Python and R (which do not have compile time type checking), the data...

https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781785889271/4/ch04lvl1sec31/sparksession-versus-sparkcontext