Questions tagged [large-data-volumes]

290 questions
6
votes
11 answers

Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once. Any advice on storing/loading the data? I've thought of converting…
Jake
  • 14,329
  • 20
  • 64
  • 85
6
votes
5 answers

How can I determine the difference between two large datasets?

I have large datasets with millions of records in XML format. These datasets are full data dumps of a database up to a certain point in time. Between two dumps new entries might have been added and existing ones might have been modified or deleted.…
NullUserException
  • 77,975
  • 25
  • 199
  • 226
6
votes
5 answers

mysql tables structure - one very large table or separate tables?

I'm working on a project which is similar in nature to website visitor analysis. It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large. Should I use a single table with…
Nir
  • 22,471
  • 25
  • 78
  • 114
6
votes
5 answers

Practical size limitations for RDBMS

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the…
grenade
  • 28,964
  • 22
  • 90
  • 125
6
votes
5 answers

Processing apache logs quickly

I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow. Here…
konr
  • 1,153
  • 12
  • 25
5
votes
5 answers

"Simulating" a 64-bit integer with two 32-bit integers

I'm writing a very computationally intense procedure for a mobile device and I'm limited to 32-bit CPUs. In essence, I'm performing dot products of huge sets of data (>12k signed 16-bit integers). Floating point operations are just too slow, so I've…
Phonon
  • 12,013
  • 12
  • 57
  • 111
5
votes
4 answers

NTFS directory has 100K entries. How much performance boost if spread over 100 subdirectories?

Context We have a homegrown filesystem-backed caching library. We currently have performance problems with one installation due to large number of entries (e.g. up to 100,000). The problem: we store all fs entries in one "cache directory". Very…
user331465
  • 2,864
  • 11
  • 42
  • 67
5
votes
7 answers

Large primary key: 1+ billion rows MySQL + InnoDB?

I was wondering if InnoDB would be the best way to format the table? The table contains one field, primary key, and the table will get 816k rows a day (est.). This will get very large very quick! I'm working on a file storage way (would this be…
James Hartig
  • 997
  • 1
  • 9
  • 20
5
votes
1 answer

How to pick a chunksize for python multiprocessing with large datasets

I am attempting to to use python to gain some performance on a task that can be highly parallelized using http://docs.python.org/library/multiprocessing. When looking at their library they say to use chunk size for very long iterables. Now, my…
Sandro
  • 2,091
  • 4
  • 24
  • 41
5
votes
1 answer

MySql: Operate on Many Rows Using Long List of Composite PKs

What's a good way to work with many rows in MySql, given that I have a long list of keys in a client application that is connecting with ODBC? Note: my experience is largely SQL Server, so I know a bit, just not MySQL specifically. The task is to…
ErikE
  • 43,574
  • 19
  • 137
  • 181
5
votes
4 answers

How to design a Real Time Alerting System?

I have an requirement where I have to send the alerts when the record in db is not updated/changed for specified intervals. For example, if the received purchase order doesn't processed within one hour, the reminder should be sent to the delivery…
Sivasubramaniam Arunachalam
  • 6,984
  • 15
  • 71
  • 126
4
votes
3 answers

Optimizing MySQL Aggregation Query

I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file. I need to write a query that will count the number of files that fit into specified…
Zenshai
  • 9,197
  • 2
  • 17
  • 18
4
votes
2 answers

Trivial task - complex solution?

There is a trivial problem: assign uniqueidentifier to any externalId do not overwrite the uniqueidentifier once it is assigned - just return existing uniqueidentifier Imagine a table ExternalId | Guid -------------------------------- …
Piotr
  • 767
  • 6
  • 21
4
votes
1 answer

Python - Search for items in hundreds of large, gzipped files

Unfortunately, I'm working with an extremely large corpus which is spread into hundreds of .gz files -- 24 gigabytes (packed) worth, in fact. Python is really my native language (hah) but I was wondering if I haven't run up against a problem that…
Georgina
  • 291
  • 4
  • 11
4
votes
1 answer

Storing Large Number of Graph Data Structures in a Database

This question asks about storing a single graph in a relational database. The solution is clear in that case: one table for nodes, one table for edges. I have a graph data structure that evolves over time, so I would like to store "snapshots" of…
Alan Turing
  • 11,403
  • 14
  • 66
  • 114
1 2
3
19 20