Questions tagged [large-data-volumes]

290 questions
72
votes
10 answers

Designing a web crawler

I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it. How does it all begin from the beginning. Say Google started with some hub pages say…
57
votes
12 answers

Using Hibernate's ScrollableResults to slowly read 90 million records

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate: ScrollableResults results =…
at.
  • 45,606
  • 92
  • 271
  • 433
33
votes
8 answers

Is it possible to change argv or do I need to create an adjusted copy of it?

My application has potentially a huge number of arguments passed in and I want to avoid the memory of hit duplicating the arguments into a filtered list. I would like to filter them in place but I am pretty sure that messing with argv array itself,…
ojblass
  • 19,963
  • 22
  • 75
  • 124
32
votes
8 answers

large amount of data in many text files - how to process?

I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional…
hatmatrix
  • 36,897
  • 38
  • 126
  • 217
28
votes
8 answers

Plotting of very large data sets in R

How can I plot a very large data set in R? I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?
Daniel Arndt
  • 1,787
  • 1
  • 11
  • 18
23
votes
7 answers

Efficiently storing 7.300.000.000 rows

How would you tackle the following storage and retrieval problem? Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row: id (unique row identifier) entity_id (takes on values between 1 and 2.000.000…
knorv
  • 45,461
  • 71
  • 205
  • 289
23
votes
2 answers

JDBC Batch Insert OutOfMemoryError

I have written a method insert() in which I am trying to use JDBC Batch for inserting half a million records into a MySQL database: public void insert(int nameListId, String[] names) { String sql = "INSERT INTO name_list_subscribers…
craftsman
  • 13,785
  • 17
  • 61
  • 82
21
votes
2 answers

Docker Data Volume Container - Can I share across swarm

I know how to create and mount a data volume container to multiple other containers using --volumes-from, but I do have a few questions regarding it's usage and limitations: Situation: I am looking to use a data volume container to store user…
deankarn
  • 442
  • 2
  • 5
  • 16
21
votes
4 answers

what changes when your input is giga/terabyte sized?

I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny. I write…
Wang
  • 2,987
  • 1
  • 18
  • 31
18
votes
6 answers

How to avoid OOM (Out of memory) error when retrieving all records from huge table?

I am given a task to convert a huge table to custom XML file. I will be using Java for this job. If I simply issue a "SELECT * FROM customer", it may return huge amount of data that eventually causing OOM. I wonder, is there a way i can process the…
janetsmith
  • 7,972
  • 11
  • 48
  • 72
17
votes
4 answers

How to do page navigation for many, many pages? Logarithmic page navigation

What's the best way of displaying page navigation for many, many pages? (Initially this was posted as a how-to tip with my answer included in the question. I've now split my answer off into the "answers" section below). To be more…
Doin
  • 6,230
  • 3
  • 31
  • 31
16
votes
5 answers

Transferring large payloads of data (Serialized Objects) using wsHttp in WCF with message security

I have a case where I need to transfer large amounts of serialized object graphs (via NetDataContractSerializer) using WCF using wsHttp. I'm using message security and would like to continue to do so. Using this setup I would like to transfer…
16
votes
2 answers

Bad idea to transfer large payload using web services?

I gather that there basically isn't a limit to the amount of data that can be sent when using REST via a POST or GET. While I haven't used REST or web services it seems that most services involve transferring limited amounts of data. If you want…
Marcus Leon
  • 50,921
  • 112
  • 279
  • 413
12
votes
10 answers

Fastest way to search 1GB+ a string of data for the first occurrence of a pattern in Python

There's a 1 Gigabyte string of arbitrary data which you can assume to be equivalent to something like: 1_gb_string=os.urandom(1*gigabyte) We will be searching this string, 1_gb_string, for an infinite number of fixed width, 1 kilobyte patterns,…
user213060
  • 1,249
  • 3
  • 18
  • 25
11
votes
7 answers

Fastest way for inserting very large number of records into a Table in SQL

The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help. Currently,…
Iravanchi
  • 4,969
  • 9
  • 38
  • 54
1
2 3
19 20