Questions tagged [large-data-volumes]
290 questions
72
votes
10 answers
Designing a web crawler
I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it.
How does it all begin from the beginning.
Say Google started with some hub pages say…
xyz
- 7,885
- 15
- 61
- 88
57
votes
12 answers
Using Hibernate's ScrollableResults to slowly read 90 million records
I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:
ScrollableResults results =…
at.
- 45,606
- 92
- 271
- 433
33
votes
8 answers
Is it possible to change argv or do I need to create an adjusted copy of it?
My application has potentially a huge number of arguments passed in and I want to avoid the memory of hit duplicating the arguments into a filtered list. I would like to filter them in place but I am pretty sure that messing with argv array itself,…
ojblass
- 19,963
- 22
- 75
- 124
32
votes
8 answers
large amount of data in many text files - how to process?
I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional…
hatmatrix
- 36,897
- 38
- 126
- 217
28
votes
8 answers
Plotting of very large data sets in R
How can I plot a very large data set in R?
I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?
Daniel Arndt
- 1,787
- 1
- 11
- 18
23
votes
7 answers
Efficiently storing 7.300.000.000 rows
How would you tackle the following storage and retrieval problem?
Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:
id (unique row identifier)
entity_id (takes on values between 1 and 2.000.000…
knorv
- 45,461
- 71
- 205
- 289
23
votes
2 answers
JDBC Batch Insert OutOfMemoryError
I have written a method insert() in which I am trying to use JDBC Batch for inserting half a million records into a MySQL database:
public void insert(int nameListId, String[] names) {
String sql = "INSERT INTO name_list_subscribers…
craftsman
- 13,785
- 17
- 61
- 82
21
votes
2 answers
Docker Data Volume Container - Can I share across swarm
I know how to create and mount a data volume container to multiple other containers using --volumes-from, but I do have a few questions regarding it's usage and limitations:
Situation: I am looking to use a data volume container to store user…
deankarn
- 442
- 2
- 5
- 16
21
votes
4 answers
what changes when your input is giga/terabyte sized?
I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write…
Wang
- 2,987
- 1
- 18
- 31
18
votes
6 answers
How to avoid OOM (Out of memory) error when retrieving all records from huge table?
I am given a task to convert a huge table to custom XML file. I will be using Java for this job.
If I simply issue a "SELECT * FROM customer", it may return huge amount of data that eventually causing OOM. I wonder, is there a way i can process the…
janetsmith
- 7,972
- 11
- 48
- 72
17
votes
4 answers
How to do page navigation for many, many pages? Logarithmic page navigation
What's the best way of displaying page navigation for many, many pages?
(Initially this was posted as a how-to tip with my answer included in the question. I've now split my answer off into the "answers" section below).
To be more…
Doin
- 6,230
- 3
- 31
- 31
16
votes
5 answers
Transferring large payloads of data (Serialized Objects) using wsHttp in WCF with message security
I have a case where I need to transfer large amounts of serialized object graphs (via NetDataContractSerializer) using WCF using wsHttp. I'm using message security and would like to continue to do so. Using this setup I would like to transfer…
jpierson
- 13,736
- 10
- 94
- 137
16
votes
2 answers
Bad idea to transfer large payload using web services?
I gather that there basically isn't a limit to the amount of data that can be sent when using REST via a POST or GET. While I haven't used REST or web services it seems that most services involve transferring limited amounts of data. If you want…
Marcus Leon
- 50,921
- 112
- 279
- 413
12
votes
10 answers
Fastest way to search 1GB+ a string of data for the first occurrence of a pattern in Python
There's a 1 Gigabyte string of arbitrary data which you can assume to be equivalent to something like:
1_gb_string=os.urandom(1*gigabyte)
We will be searching this string, 1_gb_string, for an infinite number of fixed width, 1 kilobyte patterns,…
user213060
- 1,249
- 3
- 18
- 25
11
votes
7 answers
Fastest way for inserting very large number of records into a Table in SQL
The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently,…
Iravanchi
- 4,969
- 9
- 38
- 54