Questions tagged [data-partitioning]

Data partitioning deals with the dividing of a collection of data into smaller collections of data for the purpose of faster processing, easier statistics gathering and smaller memory/persistence footprint.

277 questions
7
votes
3 answers

U-SQL Output in Azure Data Lake

Would it be possible to automatically split a table into several files based on column values if I don't know how many different key values the table contains? Is it possible to put the key value into the filename?
peterko
  • 403
  • 1
  • 5
  • 14
7
votes
6 answers

Algorithm to generate all unique permutations of fixed-length integer partitions?

I'm searching for an algorithm that generates all permutations of fixed-length partitions of an integer. Order does not matter. For example, for n=4 and length L=3: [(0, 2, 2), (2, 0, 2), (2, 2, 0), (2, 1, 1), (1, 2, 1), (1, 1, 2), (0, 1, 3), (0,…
deleted77
  • 149
  • 1
  • 5
7
votes
5 answers

3D clustering Algorithm

Problem Statement: I have the following problem: There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any…
6
votes
2 answers

Scala multi-partition a map - type mismatch; Found (A,B) => Boolean required (A,B) => Boolean?

I'm trying to multi-partition a map based on a list of predicates. I wrote the following function to do that: def multipartition[A,B](map : Map[A,B], list : List[(A,B) => Boolean]) : List[Map[A,B]] = list match { case Nil => …
6
votes
2 answers

Lomuto's Partition, stable or not?

I tried to search on Web and in my algorithms book if the Lomuto's specific solution of QSort Partition is stable or not (I know that the Hoare's version is unstable) but i didn't find a precise answer. So I've tried to make same examples and it…
Gengiolo
  • 563
  • 5
  • 14
6
votes
3 answers

Repository that support query by partition key without change interface

I am developing an application that using IDocumentClient to perform query to CosmosDB. My GenericRepository support for query by Id and Predicate. I am in trouble when change Database from SqlServer to CosmosDb, in CosmosDb, we have partition key.…
Tấn Sang
  • 1,375
  • 9
  • 21
6
votes
1 answer

What is the difference between partitioning and bucketing in Spark?

I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to…
nofar mishraki
  • 364
  • 3
  • 11
6
votes
2 answers

Get Lead value over multiple partitions

I have a problem that I feel could be solved using lag/lead + partitions but I can't wrap my head around it. Clients are invited to participate in research-projects every two years (aprox.). A number of clients is selected for each project. Some…
Henrov
  • 1,493
  • 1
  • 17
  • 45
6
votes
1 answer

How does createDataPartition function from caret package split data?

From the documentation: For bootstrap samples, simple random sampling is used. For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits. For…
happy_sisyphus
  • 1,224
  • 1
  • 15
  • 24
6
votes
3 answers

How to partition a set of values (vector) in R

I'm programming in R. I've got a vector containing, let's say, 1000 values. Now let's say I want to partition these 1000 values randomly into two new sets, one containing 400 values and the other containing 600. How could I do this? I've thought…
Daniel Standage
  • 7,189
  • 14
  • 61
  • 101
6
votes
2 answers

Creating data partition in R

With caret package, when creating data partition 75% training and 25% test, we use: inTrain<- createDataPartition(y=spam$type,p=0.75, list=FALSE) Note: dataset is named spam and target variable is named type My question is, what is the purpose of…
Aiden
  • 71
  • 1
  • 1
  • 3
6
votes
6 answers

Need algorithm for fast storage and retrieval (search) of sets and subsets

I need a way of storing sets of arbitrary size for fast query later on. I'll be needing to query the resulting data structure for subsets or sets that are already stored. === Later edit: To clarify, an accepted answer to this question would be a…
Ed Rowlett-Barbu
  • 1,489
  • 9
  • 24
6
votes
4 answers

Iterator over all partitions into k groups?

Say I have a list L. How can I get an iterator over all partitions of K groups? Example: L = [ 2,3,5,7,11, 13], K = 3 List of all possible partitions of 3 groups: [ [ 2 ], [ 3, 5], [ 7,11,13] ] [ [ 2,3,5 ], [ 7, 11], [ 13] ] [ [ 3, 11 ], [ 5, 7], […
usual me
  • 7,130
  • 6
  • 40
  • 81
5
votes
3 answers

Partition/split/section IEnumerable into IEnumerable> based on a function using LINQ?

I'd like to split a sequence in C# to a sequence of sequences using LINQ. I've done some investigation, and the closest SO article I've found that is slightly related is this. However, this question only asks how to partition the original sequence…
a developer
  • 393
  • 1
  • 4
  • 12
5
votes
3 answers

Is it acceptable to have the same input multiple times in machine learning (with different output)?

I was wondering whether in machine learning it is acceptable to have a dataset that may contain the same input multiple times, but each time with another (valid!) output. For instance in the case of machine translation, an input sentence but each…
Bram Vanroy
  • 22,919
  • 16
  • 101
  • 195
1
2
3
18 19