Questions tagged [partitioning]

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

The expectation is that with algorithms of order exponentially greater than N the total time it takes to process the smaller groups and combine the results is still less than the time it would take to process the one larger set of data.

Partitioning is similar to range partitioning in many ways. As in partitioning by RANGE, each partition must be explicitly defined.

2621 questions
159
votes
13 answers

Is Zookeeper a must for Kafka?

In Kafka, I would like to use only a single broker, single topic and a single partition having one producer and multiple consumers (each consumer getting its own copy of data from the broker). Given this, I do not want the overhead of using…
Paaji
  • 1,999
  • 3
  • 12
  • 11
137
votes
5 answers

How to define partitioning of DataFrame?

I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this. One of the data tables I'm working with contains a list of transactions, by account,…
rake
  • 2,208
  • 3
  • 12
  • 11
82
votes
3 answers

How does HashPartitioner work?

I read up on the documentation of HashPartitioner. Unfortunately nothing much was explained except for the API calls. I am under the assumption that HashPartitioner partitions the distributed set based on the hash of the keys. For example if my data…
Sohaib
  • 4,058
  • 7
  • 35
  • 66
69
votes
5 answers

Pandas: Sampling a DataFrame

I'm trying to read a fairly large CSV file with Pandas and split it up into two random chunks, one of which being 10% of the data and the other being 90%. Here's my current attempt: rows = data.index row_count =…
Blender
  • 257,973
  • 46
  • 399
  • 459
68
votes
3 answers

What is MYSQL Partitioning?

I have read the documentation (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), but I would like, in your own words, what it is and why it is used. Is it mainly used for multiple servers so it doesn't drag down one server? So, part of…
TIMEX
  • 217,272
  • 324
  • 727
  • 1,038
63
votes
14 answers

Efficient way to divide a list into lists of n size

I have an ArrayList, which I want to divide into smaller List objects of n size, and perform an operation on each. My current method of doing this is implemented with ArrayList objects in Java. Any pseudocode will do. for (int i = 1; i <=…
Rowhawn
  • 1,309
  • 1
  • 15
  • 25
48
votes
8 answers

MySQL Partitioning / Sharding / Splitting - which way to go?

We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of…
sme
  • 5,563
  • 7
  • 30
  • 30
45
votes
7 answers

LINQ Partition List into Lists of 8 members

How would one take a List (using LINQ) and break it into a List of Lists partitioning the original list on every 8th entry? I imagine something like this would involve Skip and/or Take, but I'm still pretty new to LINQ. Edit: Using C# / .Net…
Pretzel
  • 7,699
  • 16
  • 52
  • 79
44
votes
2 answers

Handling very large data with mysql

Sorry for the long post! I have a database containing ~30 tables (InnoDB engine). Only two of these tables, namely, "transaction" and "shift" are quite large (the first one have 1.5 million rows and shift has 23k rows). Now everything works fine and…
mOna
  • 1,939
  • 9
  • 26
  • 51
35
votes
3 answers

How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this: dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path) As mentioned in…
jaywilson
  • 351
  • 1
  • 3
  • 5
34
votes
1 answer

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node (sparkDriverCount) The number of worker nodes available…
smeeb
  • 22,487
  • 41
  • 197
  • 389
33
votes
5 answers

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table…
Legend
  • 104,480
  • 109
  • 255
  • 385
32
votes
3 answers

How to understand the dynamic programming solution in linear partitioning?

I'm struggling to understand the dynamic programming solution to linear partitioning problem. I am reading the The Algorithm Design Manual and the problem is described in section 8.5. I've read the section countless times but I'm just not getting…
Benedict Cohen
  • 11,592
  • 7
  • 52
  • 65
29
votes
3 answers

Database - Designing an "Events" Table

After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database…
Alix Axel
  • 141,486
  • 84
  • 375
  • 483
28
votes
4 answers

How many table partitions is too many in Postgres?

I'm partitioning a very large table that contains temporal data, and considering to what granularity I should make the partitions. The Postgres partition documentation claims that "large numbers of partitions are likely to increase query planning…
DNS
  • 34,791
  • 17
  • 84
  • 123
1
2 3
99 100