-3

Find median of all numbers in the given 500GB file at the command prompt.

File format eg:

12 
4
98
3

with one number in each line(numbers can be repeated).Can anyone please help on how to approach on this in JAVA? if we have to split the file and then how can median be calculated? I have come across several posts on median but couldn't find best approach on such huge file .

Siri
  • 15
  • 1
  • 4
    Possible duplicate of [Finding median of large set of numbers too big to fit into memory](https://stackoverflow.com/questions/3888036/finding-median-of-large-set-of-numbers-too-big-to-fit-into-memory) – juzraai Aug 02 '18 at 08:19
  • @juzraai i have checked this link previously before posting here as i could not find correct answer – Siri Aug 02 '18 at 08:22
  • Huh? Then what is the correct answer, or what do you expect from the correct answer? – user85421 Aug 02 '18 at 08:37
  • @CarlosHeuberger I am looking for approach in java to understand not in Scala as I am new to java coding and thinking of memory approach – Siri Aug 02 '18 at 08:44
  • Do I understand it right, that you can read the numbers only once from the command prompt? So it's not a file you can read multiple times? – Ridcully Aug 02 '18 at 08:44
  • 2
    Scala, I could not find any reference to Scala on that question/answers? The answers are about how to do it that is, the algorithm! – user85421 Aug 02 '18 at 08:49
  • @Ridcully 500GB file name is given at the command prompt which contains number in each line – Siri Aug 02 '18 at 08:57
  • Well, median is different from average, so does it have to filter-out the extremes? Or do you really mean average? – coladict Aug 02 '18 at 09:06
  • 1
    @Siri this has nothing to do with Scala or Java. If you don't know the algorithm, then the language doesn't matter. Computing median in BIG datasets can be tricky (as already commented - don't you need mean/average?) If you know the algorithm, then what specific problem you cannot solve? – gusto2 Aug 02 '18 at 09:08
  • What do you know about **range** of numbers? – MBo Aug 02 '18 at 09:08

2 Answers2

0

This doesn't cover the calculation itself, but here is how you read the file in small parts, so that you don't run out of memory.

try (
    InputStream fis = Files.newInputStream(Paths.get(fileName), StandardOpenOption.READ);
    BufferedReader book = new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8));
) {
    String line = null;
    long cnt = 0;
    while ((line = book.readLine()) != null) {
        cnt++;
        BigInteger data = new BigInteger(line);
        ... handle the data
        if (cnt % 500 == 0) System.gc(); // invoke garbage collector
    }
}

I recently needed to import a 50mb file that gave me out-of-memory errors with a 2GB memory limit, just because of all the extra metadata that it keeps for each object, and this method helped me get through it.

coladict
  • 3,650
  • 1
  • 10
  • 21
0

500GB file with [not necessarily unique numbers represented as strings of decimal digits,] one number in each line
- that's 250_000_000_000L numbers, at most, each with no more than twice that many digits, occurrence of signs not specified.

Assuming you can allocate 1 GB of long counters, you can count the number of numbers with any given length below 25 million digits, and the total number of numbers in a first pass.
Determine the (sign and) length of the digit string to represent your median.
In subsequent passes, narrow down the range for your median, starting with number representations of same (sign and) length.

greybeard
  • 2,015
  • 5
  • 20
  • 51