72

So, having spend many years in an object oriented world with code reuse, design patterns and best practices always taken into account, I find myself struggling somewhat with code organization and code reuse in world of Spark.

If I try to write code in a reusable way, it nearly always comes with a performance cost and I end up rewriting it to whatever is optimal for my particular use case. This constant "write what is optimal for this particular use case" also affects code organization, because splitting code into different objects or modules is difficult when "it all really belongs together" and I thus end up with very few "God" object containing long chains of complex transformations. In fact, I frequently think that if I had taken a look at most of the Spark code I'm writing now back when I was working in the object oriented world, I would have winced and dismissed it as "spaghetti code".

I have surfed the internet trying to find some sort of equivalent to the best practices of the object oriented world, but without much luck. I can find some "best practices" for functional programming but Spark just adds an extra layer, because performance is such a major factor here.

So my question to you is, have any of you Spark gurus found some best practices for writing Spark code that you can recommend?

EDIT

As written in a comment, I did not actually expect anyone to post an answer on how to solve this problem, but rather I was hoping that someone in this community had come across some Martin Fowler type, who had written som articles or blog posts somewhere on how to address problems with code organization in the world of Spark.

@DanielDarabos suggested that I might put in an example of a situation where code organization and performance are conflicting. While I find that I frequently have issues with this in my everyday work, I find it a bit hard to boil it down to a good minimal example ;) but I will try.

In the object oriented world, I'm a big fan of the Single Responsibility Principle, so I would make sure that my methods were only responsible for one thing. It makes them reusable and easily testable. So if I had to, say, calculate the sum of some numbers in a list (matching some criteria) and I had to calculate the average of the same number, I would most definitely create two methods - one that calculated the sum and one that calculated the average. Like this:

def main(implicit args: Array[String]): Unit = {
  val list = List(("DK", 1.2), ("DK", 1.4), ("SE", 1.5))

  println("Summed weights for DK = " + summedWeights(list, "DK")
  println("Averaged weights for DK = " + averagedWeights(list, "DK")
}

def summedWeights(list: List, country: String): Double = {
  list.filter(_._1 == country).map(_._2).sum
}

def averagedWeights(list: List, country: String): Double = {
  val filteredByCountry = list.filter(_._1 == country) 
  filteredByCountry.map(_._2).sum/ filteredByCountry.length
}

I can of course continue to honor SRP in Spark:

def main(implicit args: Array[String]): Unit = {
  val df = List(("DK", 1.2), ("DK", 1.4), ("SE", 1.5)).toDF("country", "weight")

  println("Summed weights for DK = " + summedWeights(df, "DK")
  println("Averaged weights for DK = " + averagedWeights(df, "DK")
}


def avgWeights(df: DataFrame, country: String, sqlContext: SQLContext): Double = {
  import org.apache.spark.sql.functions._
  import sqlContext.implicits._

  val countrySpecific = df.filter('country === country)
  val summedWeight = countrySpecific.agg(avg('weight))

  summedWeight.first().getDouble(0)
}

def summedWeights(df: DataFrame, country: String, sqlContext: SQLContext): Double = {
  import org.apache.spark.sql.functions._
  import sqlContext.implicits._

  val countrySpecific = df.filter('country === country)
  val summedWeight = countrySpecific.agg(sum('weight))

  summedWeight.first().getDouble(0)
}

But because my df may contain billions of rows I would rather not have to perform the filter twice. In fact, performance is directly coupled to EMR cost, so I REALLY don't want that. To overcome it, I thus decide to violate SRP and simply put the two functions in one and make sure I call persist on the country-filtered DataFrame, like this:

def summedAndAveragedWeights(df: DataFrame, country: String, sqlContext: SQLContext): (Double, Double) = {
  import org.apache.spark.sql.functions._
  import sqlContext.implicits._

  val countrySpecific = df.filter('country === country).persist(StorageLevel.MEMORY_AND_DISK_SER)
  val summedWeights = countrySpecific.agg(sum('weight)).first().getDouble(0)
  val averagedWeights = summedWeights / countrySpecific.count()

  (summedWeights, averagedWeights)
}

Now, this example if of course a huge simplification of what's encountered in real life. Here I could simply solve it by filtering and persisting df before handing it to the sum and avg functions (which would also be more SRP), but in real life there may be a number of intermediate calculations going on that are needed again and again. In other words, the filter function here is merely an attempt to make a simple example of something that will benefit from being persisted. In fact, I think calls to persist is a keyword here. Calling persist will vastly speed up my job, but the cost is that I have to tightly couple all code that depends on the persisted DataFrame - even if they are logically separate.

zero323
  • 283,404
  • 79
  • 858
  • 880
Glennie Helles Sindholt
  • 11,095
  • 3
  • 40
  • 47
  • 5
    Any language in particular? I'm not guru at all, but for Java and Scala I don't think there's a reason to not structure your code following their own standards. Databriks Reference Apps (https://github.com/databricks/reference-apps/tree/master/timeseries) are a really good beginning to structure spark projects. Hope it helps! – Marco Sep 25 '15 at 09:24
  • I explore different methods and have that spaghetti code you're talking about when I first get to know a dataset. Then I think about how to classify the data I'm working with. How does it mutate, etc. From there, the classical software design patterns tend to work for me. – Myles Baker Mar 07 '16 at 22:47
  • 2
    Also - I don't see an industry consensus on how to write efficient, scalable code in a distributed environment that is reusable. The patterns are typically highly coupled with data, so you have to work diligently to create data using agreed-upon standards. For some problems this will never be efficient enough. – Myles Baker Mar 07 '16 at 22:52
  • I totally understand the question, but I think it covers all Stack Overflow close reasons except of Duplicate. What do you think of another way of putting it? Maybe show a minimal example of a case where code organization and performance are conflicting goals, and ask how that conflict could be resolved. I think it would work well as an addition to the current question content, and make it possible to give a specific answer. – Daniel Darabos Mar 09 '16 at 17:41
  • I think a lot of the mess of Spark comes from SQL itself. If you think about what you had to do to create these long, text-based SQL statements that's a mess too. I'm no guru, but one thing I would recommend is to a) use `DataFrames`; and b) use the versions of `select`, `groupBy` etc that take `Column*` args. I find it a lot cleaner to create a `Seq[Column]` on the fly then a text string that represents the same thing. So don't just `registerTempTable` and start throwing SQL strings at it. – David Griffin Mar 11 '16 at 04:55
  • @DavidGriffin I actually always use DataFrames, and I have never been a fan of spark-sql, so I don't use that - but the problem remains. I understand Daniel Darabos comment and to be honest I think the topic is worthy of a book, but when I posted the question, I was hoping that someone out there in the SO community had come across - or perhaps written themselves - some blog posts or articles on the subject that they could refer to. – Glennie Helles Sindholt Mar 11 '16 at 08:32
  • 1
    @DanielDarabos I understand your comment and I'm thinking of a good example to put with the question. – Glennie Helles Sindholt Mar 11 '16 at 08:34
  • Weakness of SO in that this was closed. – thebluephantom Mar 05 '19 at 23:06
  • In my opinion, the problem is that you are breaking the SRP in both methods performing a filter, if you create a filter method and then pass the data subset to this methods, the problem is solved. – Óscar Andreu Mar 18 '19 at 08:16

1 Answers1

22

I think you can subscribe Apache Spark, databricks channel on youtube, listen more and know more, especially for the experiences and lessons from others.

here is some videos recommended:

and I've posted and still updating it on my github and blog:

hope this can help you ~

taotao.li
  • 980
  • 1
  • 10
  • 21
  • 12
    I want to thank you for the comprehensive list and I agree with you that there are lots of lessons to be learned from others (kind of why I posted the question in the first place ;-)). However, I actually feel that I have a fairly good understanding of how Spark works - it's more a matter of writing code in a way that is not only optimal with respect to performance but also allow the use of good coding practices, such a separation of concerns. For now at least I find the two concepts are more or less mutually exclusive. – Glennie Helles Sindholt Apr 15 '16 at 07:54
  • @Glennie, Right now I am in the same situation as you were. It would be really helpful if you could share how you overcome these issues – Jeevan Oct 18 '18 at 15:04
  • 2
    I truly wish I could tell you that I have found a way to overcome them - I haven't :( I have just come to accept, that code in the functional programming world is tightly coupled. The only upside is that with functional programming, I write much, MUCH less code, so having duplicated code is less problematic than in OOP... – Glennie Helles Sindholt Oct 19 '18 at 09:34
  • DWH and BI with say Informatica or BODS really only have reusable modules for LKPs, conversions like UDFs imho. I concur with your observations. – thebluephantom Mar 05 '19 at 22:38