0

Suppose i have a dataset :

+----+----+----+-----+----+
||col0|col2|col2|col3|col4|
+----+----+----+-----+----+
|   t0|  10| 100|cat26|30.9|
|   t1|  20| 200|cat13|22.1|
|   t2|  30| 300|cat26|30.9|
|   t3|  40| 400|cat26|30.9|
|   t4|  50| 500|cat15|15.3|
|   t5|  60| 600|cat13|22.1|
+----+----+----+-----+----+

I select a sub-dataset using where() :

  Dataset<Row> subDf = dF.where("col3 = cat26 ");

Yielding :

     +----+----+----+-----+----+
     |col0|col2|col2|col3 |col4|
     +----+----+----+-----+----+
     |  t0|  10| 100|cat26|30.9|
     |  t2|  30| 300|cat26|30.9|
     | t3 |  40| 400|cat26|30.9|
     +----+----+----+-----+----+

I want to create different combinations using these three rows to form for each combination a dataset.

An example of a combination is as follow :

     +----+----+----+-----+----+
     |col0|col2|col2|col3 |col4|
     +----+----+----+-----+----+
     |  t0|  10| 100|cat26|30.9|
     |  t2|  30| 300|cat26|30.9|
     +----+----+----+-----+----+

Another one is :

 +----+----+----+-----+----+
 |col0|col2|col2|col3 |col4|
 +----+----+----+-----+----+
 |  t2|  30| 300|cat26|30.9|
 |  t3|  40| 400|cat26|30.9|
 +----+----+----+-----+----+

Third possible combination is as follows:

     +----+----+----+-----+----+
     |col0|col2|col2|col3 |col4|
     +----+----+----+-----+----+
     |  t0|  10| 100|cat26|30.9|
     |  t3|  40| 400|cat26|30.9|
     +----+----+----+-----+----+

And last :

     +----+----+----+-----+----+
     |col0|col2|col2|col3 |col4|
     +----+----+----+-----+----+
     |  t0|  10| 100|cat26|30.9|
     |  t2|  30| 300|cat26|30.9|
     | t3 |  40| 400|cat26|30.9|
     +----+----+----+-----+----+

PS : in the previous example each combination (a dataset) had at least two rows,

How to achieve just that in JAVA ? Thank you. .

1 Answers1

1

First things first, the last combination with the t0, t2, t3 rows is the same as the output in the simple query with the where method, so we need to focus on the other unique pairs we need to find:

  • {t0, t2}
  • {t0, t3}
  • {t2, t3}

We basically need to execute queries on the given DataFrame using the where method once again for every combination, because we can actually specify that we want to return a list with all the rows that match either one or more values from one column through SQL like this WHERE col0 == t0 OR col0 == t2. That means that we need to isolate and retrieve the list of the row values of col0 after the where method you used to filter out everything not matching cat26 in col3 (Java code snippet based on this answer by user12910640).

// Scala
val rowList = input.where("col3 == \"cat26\"")
          .select("col0")
          .rdd
          .map(row => row(0).toString)
          .collect()
          .toList

// Java
List<String> rowList = input.toJavaRDD()
          .where("col3 == \"cat26\"")
          .select("col0")
          .map(new Function<Row, String>() {
              public String call(Row row) {return row.getAs("column_name").toString();}
}).collect();

rowList has 3 elements in it: t0, t2, t3. That means we just need to iterate through this list in a way that we won't have duplicate pairs (e.g. {t0, t2} and {t2, t0}) and we won't have pairs with the same element (e.g. {t0, t0}). To get rid of those two cases, we simply use a second iteration loop for every first iteration that scans for every element after the current element of the first iteration (you can understand it better by looking at the code) and check that the indexes i and j of the iterations are not matching before we execute a query to the DataFrame.

// Scala
for(i <- 0 until rowList.size)
{
    for(j <- i until rowList.size)
    {
        if(!rowList(i).equals(rowList(j)))
            input.where("col0 == \"" + rowList(i) + "\" OR col0 == \"" + rowList(j) + "\"").show()
    }
}

// Java
for(int i = 0; i < rowList.size(); i++)
{
    for(int j = i; j < rowList.size(); j++)
    {
        if(!rowList(i).equals(rowList(j)))
            input.where("col0 == \"" + rowList.get(i) + "\" OR col0 == \"" + rowList.get(j) + "\"").show()
    }
}

Through this loop we can see the desired output DataFrames in the console.

+----+----+----+-----+----+
|col0|col1|col2| col3|col4|
+----+----+----+-----+----+
|  t0|  10| 100|cat26|30.9|
|  t2|  30| 300|cat26|30.9|
+----+----+----+-----+----+

21/04/27 16:15:47 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.2.5:36343 in memory (size: 4.1 KiB, free: 292.8 MiB)
+----+----+----+-----+----+
|col0|col1|col2| col3|col4|
+----+----+----+-----+----+
|  t0|  10| 100|cat26|30.9|
|  t3|  40| 400|cat26|30.9|
+----+----+----+-----+----+

+----+----+----+-----+----+
|col0|col1|col2| col3|col4|
+----+----+----+-----+----+
|  t2|  30| 300|cat26|30.9|
|  t3|  40| 400|cat26|30.9|
+----+----+----+-----+----+

You can alternatively try to store those outputs in a DataFrame list along with the first one that matches the t0, t2, t3 rows, instead of just showing them to the console.

Coursal
  • 1,249
  • 2
  • 15
  • 30
  • Does your code work if the original dataset was composed of 4 rows and we needed combinations of 3 rows and one 4 rows, as well ? – Martin Moore Apr 27 '21 at 14:49
  • In that case, all you have to change is putting another `for` loop to iterate for a third row, change the `if` expression so it checks for three instead of two same elements per loop, and modify the `where` expression accordingly. The logic stays the same of course. – Coursal Apr 27 '21 at 16:13
  • is there a spark native method that can do that for n rows (instead of the discussed 3 or 4 rows)? – Martin Moore Apr 27 '21 at 16:34
  • Maybe using a groupBy ? – Martin Moore Apr 27 '21 at 17:01
  • 1
    I know for sure that Spark does have a way to do that kind of combination based on _columns_, but I doubt there's anything for trying to do that based on rows. It is also a heavy computation to have since the time complexity for this example goes through the roof and that can't scale good enough either way. Your best bet is trying to parameterize as much as you can, like having a check for how many rows are to be used for the combinations, but you will always need to specify them as written above in the last `where` command. You can create a new question about this to be sure for this, though. – Coursal Apr 27 '21 at 17:15
  • Rather than looping many times, why can't it be like this : get the records through a filter, then find combinations of the indexes of the rows, then select the rows. – Martin Moore Apr 27 '21 at 21:15
  • But that _is_ the logic here. we filter the records as you did, we extract the row values that have the purpose of the indices here, and use them to select the rows. You can't bypass looping this many times, because that's the bad thing with all possible combinations without any duplicates; they scale badly. (Here are two answers with the math behind them to have a better look at why nested loops are a bitter necessity: https://stackoverflow.com/a/37847498/5644037 and https://stackoverflow.com/a/18859475/5644037) – Coursal Apr 27 '21 at 21:38
  • I can apply a basic Java method to find combinations of an array of integer : the indexes of the rows. After that iterate over the dataset to get those rows by index, or select them after having collected them as list. – Martin Moore Apr 27 '21 at 21:44
  • 1
    Ah, now I get it. You are mainly talking about standard single node running application and I was talking about a more in-parallel approach. Yes, I believe you can do such a thing and it can work just fine. It will be a serial setback for a parallel execution in a cluster though, but you can experiment with that approach of yours. Just remember that Spark is designed in a way to make the best out of many machines, so parallelism is most of the time the goal. – Coursal Apr 27 '21 at 21:50
  • 1
    Alright, thanks ! – Martin Moore Apr 27 '21 at 21:54