3

My application has been built utilizing MongoDB as a platform. One collection in DB has massive volume of data and have opted for apache spark to retrieve and generate analytical data through calculation. I have configured Spark Connector for MongoDB to communicate with MongoDB. I need to query MongoDB collection using pyspark and build a dataframe consisting of resultset of mongodb query. Please suggest me an appropriate solution to it.

Rubin Porwal
  • 3,218
  • 1
  • 17
  • 25
  • Why don't you try the [Stratio connector](https://github.com/Stratio/Spark-MongoDB)? this connector will return you a dataframe directly – John Zeng Aug 09 '16 at 10:01
  • @JohnZeng Following is code snippet which I have implemented using stratio connector . df = sqlContext.read.format('com.stratio.datasource.mongodb').options(host='localhost:27017', database='mydb', collection='mycoll').load() – Rubin Porwal Aug 09 '16 at 10:04
  • Could you pls edit your question and paste the snippet? I think you have already gotten a dataframe after you call this. I am confuse about what do you want now because your question is linking to the MongoDB's connector. – John Zeng Aug 09 '16 at 10:11
  • If you do need to create a dataframe from a RDD, you can refer to this document : [inter operating with rdd](http://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds) – John Zeng Aug 09 '16 at 10:16
  • @JohnZeng Thank you – Rubin Porwal Aug 09 '16 at 10:28

2 Answers2

7

You can load the data directly into a dataframe like so:

# Create the dataframe
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/mydb.mycoll").load()

# Filter the data via the api
df.filter(people.age > 30)

# Filter via sql
df.registerTempTable("people")
over_thirty = sqlContext.sql("SELECT name, age FROM people WHERE age > 30")

For more information see the Mongo Spark connector Python API section or the introduction.py. The SQL queries are translated and passed back to the connector so that the data can be queried in MongoDB before being sent to the spark cluster.

You can also provide your own aggregation pipeline to apply to the collection before returning results into Spark:

dfr = sqlContext.read.option("pipeline", "[{ $match: { name: { $exists: true } } }]")
df = dfr.option("uri", ...).format("com.mongodb.spark.sql.DefaultSource").load()
Ross
  • 16,827
  • 2
  • 47
  • 69
  • Thank you @Ross .But I directly need to apply filter into database query instead of applying filter across data frame – Rubin Porwal Aug 09 '16 at 10:29
  • That will translate it into a query on the collection, the connector will then return the filtered results. – Ross Aug 09 '16 at 12:59
  • Will you please elaborate it through a code snippet – Rubin Porwal Aug 09 '16 at 13:01
  • I've clarified the what happens when using filters and sql on dataframes. I've also added an example of providing an aggregation pipeline. – Ross Aug 09 '16 at 13:18
  • @Ross , I think this code first reads the data from mongo collection and then applies filter on dataframe. I ran this code on my mongodb collection which is more than 10gb in size. and it took a lot of time to load data and still running. Do you know any other way to apply filtering directly on spark.read() process? – Joseph D Mar 22 '21 at 10:27
1

For my case filtering did not give the expected performance, as all filtering happened in Spark and not in Mongo. To improve performance, I had to pass a manual aggregation pipeline when loading the data. This can be a bit difficult to find, since the official documentation only talks how to do it with RDDs.

After a lot of trials I managed to do this with Scala dataframes:

val pipeLine =  "[{ $match: { 'data.account.status: 'ACTIVE', " +
      "'data.account.activationDate: {$gte : '2020-10-11', $lte : '2020-10-13'}}}]"

  val readConfig: ReadConfig = ReadConfig(
    Map(
      "uri" -> getMongoURI(),
      "database" -> dataBaseName,
      "collection" -> collection,
      "pipeLine" -> pipeLine
    )
  )

// This one took 260 seconds
val df: DataFrame = MongoSpark.load(sparkSession, readConfig)
df.count()

The alternative using filters and no pipeline fetches all data to Spark. It should not be the case, but I presume it has to do with the query used.

  val readConfig: ReadConfig = ReadConfig(
    Map(
      "uri" -> getMongoURI(),
      "database" -> dataBaseName,
      "collection" -> collection
    )
  )
// This one took 560 seconds
val df: DataFrame = MongoSpark.load(sparkSession, readConfig)
df.filter("data.account.status == 'ACTIVE' AND " +
      "data.account.activationDate>= '2020-05-13' AND data.account.activationDate <= '2021-06-05'"
    ).count()

I did some tests to fetch 400K documents from a localhost Mongo DB holding in total 1.4M documents:

  1. (RDD + pipeline), as per official documentation: 144 seconds
  2. (DF + pipeline), as per example above: 260 seconds
  3. (DF with filters), as per example above: 560 seconds
  4. (RDD + pipeline).toDf: 736 seconds

We finally went for the second option, because of some other high-level benefits of working with dataframes vs RDDs.

Finally, don't forget to create the correct indexes in MongoDB!

Edit: I am using spark-sql 2.3.1, mongo-spark-connector 2.3.2 and mongo-java-driver 3.12.3.

user1485864
  • 409
  • 4
  • 16