Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes.

There are several features that allow to separate this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer
  • Relationship between data elements is extremely complex

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely to take many years to finish
  • Fast distributed algorithms are used instead

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures
  • One storage device is incapable of holding all the data set

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce etc
7093 questions
82
votes
8 answers

Best way to delete millions of rows by ID

I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days. I tried putting them in a table and doing it in batches of 100. 4 days later, this is still…
Anthony Greco
  • 2,745
  • 4
  • 23
  • 38
70
votes
3 answers

Calculating and saving space in PostgreSQL

I have a table in pg like so: CREATE TABLE t ( a BIGSERIAL NOT NULL, -- 8 b b SMALLINT, -- 2 b c SMALLINT, -- 2 b d REAL, -- 4 b e REAL, …
punkish
  • 9,855
  • 20
  • 61
  • 86
62
votes
1 answer

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx and other usefull libraries. I want to perform operations such as pairwise distance between all of the…
Ekgren
  • 914
  • 1
  • 8
  • 12
59
votes
5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…
lauri108
  • 1,083
  • 1
  • 11
  • 16
54
votes
4 answers

How to create a large pandas dataframe from an sql query without running out of memory?

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory. This works: import pandas.io.sql as psql sql = "SELECT TOP…
slizb
  • 3,874
  • 3
  • 21
  • 22
54
votes
12 answers

Hbase quickly count number of rows

Right now I implement row count over ResultScanner like this for (Result rs = scanner.next(); rs != null; rs = scanner.next()) { number++; } If data reaching millions time computing is large.I want to compute in real time that i don't want to…
cldo
  • 1,685
  • 6
  • 21
  • 26
45
votes
4 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…
43
votes
3 answers

Is there something like Redis DB, but not limited with RAM size?

I'm looking for a database matching these criteria: May be non-persistent; Almost all keys of DB need to be updated once in 3-6 hours (100M+ keys with total size of 100Gb) Ability to quickly select data by key (or Primary Key) This needs to be a…
Andrey
  • 439
  • 1
  • 4
  • 5
42
votes
4 answers

sklearn and large datasets

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory. I use a lot sklearn but for much smaller datasets. In this situations the classical approach should be something like. Read only part of the…
Donbeo
  • 14,217
  • 30
  • 93
  • 162
41
votes
2 answers

How to get started with Big Data Analysis

I've been a long time user of R and have recently started working with Python. Using conventional RDBMS systems for data warehousing, and R/Python for number-crunching, I feel the need now to get my hands dirty with Big Data Analysis. I'd like to…
harshsinghal
  • 3,600
  • 8
  • 31
  • 32
41
votes
5 answers

Recommended package for very large dataset processing and machine learning in R

It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory? If R is simply the…
user334911
38
votes
1 answer

How can I tell when my dataset in R is going to be too large?

I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead…
Heather Stark
  • 607
  • 7
  • 18
37
votes
1 answer

What methods can we use to reshape VERY large data sets?

When due to very large data calculations will take a long time and, hence, we don't want them to crash, it would be valuable to know beforehand which reshape method to use. Lately, methods for reshaping data have been further developed regarding…
jay.sf
  • 33,483
  • 5
  • 39
  • 75
37
votes
3 answers

Machine Learning & Big Data

In the beginning, I would like to describe my current position and the goal that I would like to achieve. I am a researcher dealing with machine learning. So far have gone through several theoretical courses covering machine learning algorithms and…
Niko Gamulin
  • 63,517
  • 91
  • 213
  • 274
34
votes
1 answer

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node (sparkDriverCount) The number of worker nodes available…
smeeb
  • 22,487
  • 41
  • 197
  • 389
1
2 3
99 100