Questions tagged [bigdata]

Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

Big data is a concept that deals with data sets of extreme volumes.

There are several features that allow to separate this concept into a distinct one:

Data

  • Data is so large it cannot be processed on a single computer
  • Relationship between data elements is extremely complex

Algorithms

  • Local algorithms that take longer than O(N) to compute will likely to take many years to finish
  • Fast distributed algorithms are used instead

Storage

  • Underlying data storage shall be fault-tolerant and keep data in a consistent state independently of device failures
  • One storage device is incapable of holding all the data set

Eco-system

  • Big data is also synonymous with the set of tools which are used to process huge amounts of data. This is also known as big data eco-system. Popular tools are HDFS, Spark, MapReduce etc
7093 questions
2
votes
1 answer

Elastic search one index or multiple index for same data

I'm building an application which could greatly benefit from ElasticSearch. In my current version I'm using 1 single index: "messages" with just 1 type: "message". Messages are composed of the following format (averaging 10kb): messages - id -…
Floris
  • 279
  • 3
  • 16
2
votes
1 answer

The best way to filter large data sets

I have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various end-of-day reports with all sorts of criteria. The relevant tables…
JVC
  • 23
  • 4
2
votes
2 answers

What kind of NoSQL storage should we use?

We are a IoT company that provide services for transportation and logistics companies. As a infrastructure service provider we offer GPS tracking devices to our client. Although the format of GPS tracking data is very neat (gpsId, longitude,…
Wuaner
  • 833
  • 1
  • 13
  • 29
2
votes
2 answers

Filtering in pig by concatenating two column

I have two table in the following format Table 1: com_Data #cc bb mm# 41 22 2563 42 24 3562 Table 2: #name cid# sasi 41-22-2563 soman 42-47-2562 I want to compaine the three column cc bb mm from table 1 and need to filter out all the column…
Anas A
  • 199
  • 3
  • 19
2
votes
1 answer

Grails App with Huge Tables

I'm trying to create a database from existing csv files that are about 20,000 columns wide and 700 rows deep. In grails I would like the 20,000 column domain to belongTo another simpler domain (about 200 columns). But upon compilation I get:…
janDro
  • 1,186
  • 1
  • 9
  • 23
2
votes
2 answers

Transform data in Google bigquery - extract text, split it into multiple columns and pivoting the data

I have some weblog data in big query which I need to transform to make it easier to use and query. The data looks like: I want to extract and transform the data within the curled brackets after Results{…..} (colored blue). The data is of the form…
2
votes
2 answers

Subtract all pairs of values from two arrays

I have two vectors, v1 and v2. I'd like to subtract each value of v2 from each value of v1 and store the results in another vector. I also would like to work with very large vectors (e.g. 1e6 size), so I think I should be using numpy for…
jpcgandre
  • 1,427
  • 4
  • 29
  • 53
2
votes
1 answer

Neo4j Relationship Index - Search on relationship property

I've got a neo4j graph with the following structure. (Account) ---[Transaction]--- (Account) Transaction is a neo4j relationship and Account is a node. There are set various properties on each transaction, such as the transaction ID, amount, date,…
Imme22009
  • 3,474
  • 5
  • 27
  • 47
2
votes
2 answers

Function awfully slow

I was looking for historical data from our Brazilian stock market and found it at Bovespa's website. The problem is the format the data is in is terrible, it is mingled with all sorts of other information about any particular stock! So far so good!…
Luis Dos Reis
  • 363
  • 3
  • 15
2
votes
3 answers

Hadoop use-case scenario

I would like to have some expert views on the use of a Big Data platform like Hadoop in one of my project scenarios. I am a complete novice in this technology although I understand databases like MySQL well. We are creating a product which would be…
user1826116
  • 275
  • 1
  • 3
  • 14
2
votes
1 answer

How to append the output of Parallel Grep to a file?

I have a file of 500 MB, and a pattern file of 20MB. Since it was taking too much time to grep the 1.2 million patterns from the file with 5 million lines, I split the pattern file into 100 parts. I tried to run Grep parallely with the multiple…
Rohit
  • 85
  • 1
  • 4
2
votes
1 answer

Please help what is the necessity of Shuffle and Sorting in Hadoop?

In a normal wordcount program in mapreduce, do we need to set any method for shuffle and sort, or the framework will take care of this?
shakti
  • 23
  • 3
2
votes
1 answer

How to train neural networks on big sample sets in Matlab?

I am trying to train neural network on big training set. inputs consists of aprox 4 million of columns and 128 rows, and targets consisting of 62 rows. hiddenLayerSize is 128. The script is follows: net =…
Suzan Cioc
  • 26,725
  • 49
  • 190
  • 355
2
votes
1 answer

How to handle large amount of documents stored in a database?

I am working on one application where user can scan/upload documents. Application processes those documents and store it in the database. We are using MySQL database. Right now we are having more than 200,000 documents in the database. So we are…
pan1490
  • 919
  • 1
  • 6
  • 22
2
votes
0 answers

Same data in two different databases to improve query performance

I'm storing social data in Elasticsearch, but it's so difficult to query it without any kind of joins. So, I'm thinking a possible way: All docs in elasticsearch. Complete docs with all infos. All relations in neo4j. Only queriable data (dates,…
user3175226
  • 3,159
  • 5
  • 26
  • 41
1 2 3
99
100