Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes.

There are several features that allow to separate this concept into a distinct one:

Data

Data is so large it cannot be processed on a single computer
Relationship between data elements is extremely complex

Algorithms

Local algorithms that take longer than O(N) to compute will likely to take many years to finish
Fast distributed algorithms are used instead

Storage

Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures
One storage device is incapable of holding all the data set

Eco-system

Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce etc

7093 questions

votes

8 answers

Best way to delete millions of rows by ID

I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days. I tried putting them in a table and doing it in batches of 100. 4 days later, this is still…

asked Nov 28 '11 at 02:29

Anthony Greco

2,745
4
23
38

votes

3 answers

Calculating and saving space in PostgreSQL

I have a table in pg like so: CREATE TABLE t ( a BIGSERIAL NOT NULL, -- 8 b b SMALLINT, -- 2 b c SMALLINT, -- 2 b d REAL, -- 4 b e REAL, …

postgresql database-design storage bigdata

asked Jun 03 '10 at 13:44

punkish

9,855
20
61
86

votes

1 answer

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx and other usefull libraries. I want to perform operations such as pairwise distance between all of the…

python arrays numpy scipy bigdata

asked Apr 22 '13 at 14:36

Ekgren

votes

5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…

apache-spark emr amazon-emr bigdata

asked Nov 24 '16 at 08:33

lauri108

1,083
1
11
16

votes

4 answers

How to create a large pandas dataframe from an sql query without running out of memory?

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory. This works: import pandas.io.sql as psql sql = "SELECT TOP…

python sql pandas bigdata

asked Aug 07 '13 at 15:50

slizb

3,874
3
21
22

votes

12 answers

Hbase quickly count number of rows

Right now I implement row count over ResultScanner like this for (Result rs = scanner.next(); rs != null; rs = scanner.next()) { number++; } If data reaching millions time computing is large.I want to compute in real time that i don't want to…

hadoop hbase bigdata

asked Jul 07 '12 at 12:42

cldo

1,685
6
21
26

votes

4 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…

apache-spark spark-dataframe rdd apache-spark-2.0 bigdata

asked Jun 28 '17 at 16:49

Avishek Bhattacharya

5,052
3
27
45

votes

3 answers

Is there something like Redis DB, but not limited with RAM size?

I'm looking for a database matching these criteria: May be non-persistent; Almost all keys of DB need to be updated once in 3-6 hours (100M+ keys with total size of 100Gb) Ability to quickly select data by key (or Primary Key) This needs to be a…

database redis nosql bigdata

asked Aug 26 '13 at 15:12

Andrey

votes

4 answers

sklearn and large datasets

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory. I use a lot sklearn but for much smaller datasets. In this situations the classical approach should be something like. Read only part of the…

python bigdata scikit-learn

asked May 26 '14 at 15:00

Donbeo

14,217
30
93
162

votes

2 answers

How to get started with Big Data Analysis

I've been a long time user of R and have recently started working with Python. Using conventional RDBMS systems for data warehousing, and R/Python for number-crunching, I feel the need now to get my hands dirty with Big Data Analysis. I'd like to…

python r hadoop bigdata

asked Dec 01 '10 at 08:45

harshsinghal

3,600
8
31
32

votes

5 answers

Recommended package for very large dataset processing and machine learning in R

It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory? If R is simply the…

r machine-learning signal-processing bigdata

asked Jun 15 '12 at 17:29

user334911

votes

1 answer

How can I tell when my dataset in R is going to be too large?

I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead…

r bigdata logfile-analysis

asked Oct 07 '12 at 08:57

Heather Stark

votes

1 answer

What methods can we use to reshape VERY large data sets?

When due to very large data calculations will take a long time and, hence, we don't want them to crash, it would be valuable to know beforehand which reshape method to use. Lately, methods for reshaping data have been further developed regarding…

r performance bigdata reshape

asked Mar 09 '19 at 13:08

jay.sf

33,483
5
39
75

votes

3 answers

Machine Learning & Big Data

In the beginning, I would like to describe my current position and the goal that I would like to achieve. I am a researcher dealing with machine learning. So far have gone through several theoretical courses covering machine learning algorithms and…

machine-learning bigdata

asked Dec 07 '12 at 10:10

Niko Gamulin

63,517
91
213
274

votes

1 answer

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node (sparkDriverCount) The number of worker nodes available…

apache-spark spark-dataframe distributed-computing partitioning bigdata

asked Sep 08 '16 at 00:57

smeeb

22,487
41
197
389

2 3

…

99 100 Next