How to sort integer numbers in a huge text file?

Question

Problem Statement

I'm given a very large list of numbers one at a time, and I need to print the "median number".

To be more clear there can be "125,000,000" numbers and it is guaranteed that each number is less than "1.e+18".

It's for a contest, so there is memory limit (20 MB tops) and time limit (5 seconds tops).

"The median number" is the one that is in the middle of sorted numbers.
For instance if this is the list of numbers:

After sorting numbers:

"The median number" would be 16;

So I searched for it in the Internet but I couldn't find any answer that pass these limits.

Approach

My approach was to get all the numbers, save them in a text file, sort them, then get "the median number".

Ideas

I know I can read all the numbers from the file and put them in a vector then easily sort them.
But this will exceed memory limit.
So I came up with this idea to sort numbers as I put them in text file.
It would be like there is a loop that after getting next number from the console reads the file (line by line) and when it reaches right place, inserts number there and doesn't touch other numbers.
But the problem is that I can't insert a line in the middle of text file because it will overwrite on other numbers.
So I created two files that one of them had numbers that were already inputted and the other one read the first file and copy it's number in itself until reach right place then insert last given number then continue copying remaining numbers.
But it spends too much time so it exceeded time limit.

Request

So I want to either optimize one of these ideas in order to pass limits or any new idea that pass these limits.

Preference

I prefer to use second idea because unlike other two, it passes limits but I can't do it because I don't know how to insert a line in the middle of a text file. So if I learn this, rest of the process will be so easy.

Attempted Solution

This is a function that receives a number and, by reading through a file, finds best place for it and puts it there.
As the matter of fact it represents my third idea.
So it works (I tested it with lots of inputs) but the problem as I mentioned before is with the time limit.

void insertNewCombinedNumber ( int combinedNumber )
{
    char combinedNumberCharacterArray[ 20 ];
    bool isInserted = false;

    ofstream combinedNumbersOutputFile;
    ifstream combinedNumbersInputFile;

    // Operate on First File
    if ( isFirstCombinedFileActive )
    {
        combinedNumbersOutputFile.open ( "Combined Numbers - File 01.txt" );
        combinedNumbersInputFile.open ( "Combined Numbers - File 02.txt" );
    }
    // Operate on Second File
    else
    {
        combinedNumbersOutputFile.open ( "Combined Numbers - File 02.txt" );
        combinedNumbersInputFile.open ( "Combined Numbers - File 01.txt" );
    }

    if ( !combinedNumbersInputFile )
    {
        combinedNumbersInputFile.close ();

        ofstream combinedNumbersInputCreateFile ( "Combined Numbers - File 02.txt" );
        combinedNumbersInputCreateFile.close ();

        combinedNumbersInputFile.open ( "Combined Numbers - File 02.txt" );
    }

    combinedNumbersInputFile.getline ( combinedNumberCharacterArray , 20 );

    for ( int i = 0; !combinedNumbersInputFile.eof (); i++ )
    {
        if ( !isInserted && combinedNumber <= characterArrayToDecimal ( combinedNumberCharacterArray ) )
        {
            combinedNumbersOutputFile << combinedNumber << endl;
            isInserted = true;
        }

        combinedNumbersOutputFile << combinedNumberCharacterArray << endl;

        combinedNumbersInputFile.getline ( combinedNumberCharacterArray , 20 );
    }

    if ( !isInserted )
    {
        combinedNumbersOutputFile << combinedNumber << endl;
        isInserted = true;
    }

    isFirstCombinedFileActive = !isFirstCombinedFileActive;

    combinedNumbersOutputFile.close ();
    combinedNumbersInputFile.close ();
}

Read 20 MB of numbers at a time, sort each subset, write each subset to a different file, merge the files. — jxh, Jul 09 '18 at 17:32
If I was faced with this problem, I would research different sorting algorithms (there are plenty) and see if any could be used to help dealing with these constraints. And as @NathanOliver pointed out, this [middle number](https://en.wikipedia.org/wiki/Median) might be available without fully sorting... — Max Langhof, Jul 09 '18 at 17:32
@Bijan Wait... are you directly given a file, or are you just given the numbers one at a time? — Justin, Jul 09 '18 at 18:38
@Justin As I said it is part of a contest. There are "500" numbers that user pass to program, then the program have to generate all possible numbers using below function and at last it have to print the number that is in the middle of sorted numbers. f ( x , y , z ) = z * ( y * ( x + 1 ) + 1 ). — BijanProgrammer, Jul 09 '18 at 18:43
@Bijan So do you even have to write the file, or can you just compute the middle number? — Justin, Jul 09 '18 at 18:44
@Justin It isn't necessary to write the file, but if I store all numbers in memory it will exceed memory limit. — BijanProgrammer, Jul 09 '18 at 18:47
@Justin And also I have to print the most frequent "combinedNumber". — BijanProgrammer, Jul 09 '18 at 18:48
@Bijan FYI, these are called *median* and *mode*. My point was that maybe you don't have to store the numbers in memory. You might be able to use the pattern caused by `f(x, y, z)` to compute the median and mode more easily. — Justin, Jul 09 '18 at 18:50
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174674/discussion-between-bijan-and-justin). — BijanProgrammer, Jul 09 '18 at 18:56
If you have `125,000,000 * 8-byte` numbers you have a 1GB dataset and are limited to 20MB of memory (which can hold at most `2,500,000` 8-byte values). So you will need a minimum of 50 subsets in a divide and conquer approach. If you have a Linux based system, you can `mmap` the areas to hold the subsets and sort your data 20M at a time and then merge from the mapped areas. It is a bit of a C solution. — David C. Rankin, Jul 09 '18 at 19:46
@DavidC.Rankin I'm trying to put this solution in my code, I will let you know if I succeeded. Thank you for your help. — BijanProgrammer, Jul 09 '18 at 21:10

Danny_ds · Answer 1 · 2018-07-09T23:48:25.530

Assumptions:

I will assume the list of numbers is already in binary form (because we will need multiple passes through the data, and each time converting text to binary would take extra processing time). That would be a file of 1GB (125M * 64bit).

It is also not clear if the OS disk caching of that file would count for the memory limit. I will assume it's not, because reading a 1GB file cold from disk multiple times would already take more than 5 seconds.

Solution:

So let's start with a simple example of how this could be done (we will optimize and adjust this later):

First create a histogram of ranges of numbers (say groups of 1 million as an example, but this won't work yet - see below)
So create an uint32 array of size max value / 1 million (too big for now) where we will put the count of the buckets (0-999999, 1000000-1999999, and so on).
Loop through the list of numbers and each time increment the n-th value of the array (the bucket where the number belongs).
Now that we have an array with counts we can easily calculate in which bucket (or range) the median will be.
Loop through the list again and now only store the numbers that fit in that range in an array.
Sort the array and calculate which item is the median (also using the counts of all buckets).

Of course, we need to adjust the above a bit.

First, instead of using ranges of 1 million, it's better to use a power of two. That way we can simply use an and with a mask to get the position in the bucket/count list (instead of using a more expensive division).

Second, for using buckets with ranges of 1 million we would have to create an array that is way too big.

So the best option would be to do 3 passes: first with ranges of say 1e12, and then for the range the median is in, we loop again with ranges of 1e6 (but instead use powers of 2).

This way you would only have to sort the numbers belonging to one small bucket instead of the whole set of 125 million. Sorting takes O(n log n).

An example with the numbers given in the question:

Use buckets/ranges of 16 - first pass:

array_pos   count
0 (0-15)      3
1 (16-31)     2
2 (32-47)     1
3 (48-63)     0
4 (64-79)     0
5 (80-95)     0
6 (96-111)    1

We can now calculate that the median must be in the bucket at array_pos 1.

remember/store these values:
Count before bucket 16-31: 3
Count  after bucket 16-31: 2

Second pass - read values for bucket (16-31) - (again, if the bucket sizes are a power of two we can use some bitmasking to quickly check if the number is in the range):

23
16

Sort this small array and calculate the position of the median using the 2 counts (before and after).

count
3
    16 -> median
    23
2

@Bijan The only problem could be the I/O being the bottleneck. Reading 1GB cold from disk on a 100MB/s hard disk would already take 10 seconds.. - But once that data is cached it should take less than a second to go trough it (assuming the numbers are stored in binary form). — Danny_ds, Jul 09 '18 at 21:32
@jxh Yes, I provided an answer for a text file as well. But it looks like the OP is generating the file (to stay below 20GB used memory), so a better option would be to save those numbers as binary instead of converting them to and from text (5 sec. limit). — Danny_ds, Jul 10 '18 at 13:17

score 1 · Answer 2 · answered Jul 09 '18 at 19:29

What you really need is a divide and conquer algorithm for such kinds of problems. Have a look at External merge sort and distribution sort sections in External Sorting

The idea is to sort data in to multiple chunks and then merge those chunks again using divide and conquer approach.

It has a time complexity of O(n logn), which I think will pass the time limit.

These algorithms are quite famous and you can just google to get the implementation details.

Danny_ds · Answer 3 · 2018-07-10T00:12:00.507

In my first answer I gave a solution to find the median in a list or set of binary numbers (with memory restriction), without having to sort the whole set.

Just for fun, let's look at a solution where the file contains numbers as text separated by a newline, and let's do it without converting the text to binary numbers (which can be expensive, and we can't hold them in memory).

Again, we'll use buckets (or bucket counts) but we start with grouping by number of digits.

Sample set:

First pass - group by number of digits (array_pos is number of digits this time):

array_pos  count
0            0
1            0
2            1
3            0
4            2
5            2
6            0
7            2
8            1
9            1

So, the median must have 5 digits (before: 3 - after:4).

Second pass - (assuming all 5 digit numbers wouldn't fit in the 20MB), read all 5 digit numbers and group (count) them by the first digit (or first 2, 3 or 4, depending on the count):

first_digit  count
1              0
2              1
3              0
4              0
5              1

(Actually this second pass could as well be done within the first pass because the arrays will be small in this case (depending on the number of digits we group on). We would just have to create an array for each 'number of digits').

Locate the group containing the median:

count  first_digit
3
    1        2
    1        5 -> median
4

Last pass - read all 5 digit numbers having 5 as the first digit, sort them (can be alphabetically, still no need for conversion) and locate the median (again, we only have to sort a small subset of the data).

In the small example above there's only one, but we still have to get it in the file since we didn't store the results due to memory restrictions.

For performance reasons, functions like readline() or streaming should be avoided here - instead the file should be opened in binary mode. This way we can loop directly over the bytes and just reset the digit count when a newline is encountered.

Even better would be to use memory mapping, but I guess that would be cheating in this case (20GB limit).

SynAck · Answer 4 · 2018-07-09T23:21:34.883

0

You can try median of medians algorithm.It is an in-place algorithm that has a time complexity of O(n).
1.Read here
2. Wikipedia Article

edited Jul 09 '18 at 23:21

answered Jul 09 '18 at 22:33

SynAck

346
2
17